A Universal Dependencies corpus for spoken French.
The corpus was automatically converted from the Rhapsodie treebank and then underwent many manual corrections and improvements.
The corpus is maintained in the SUD format and is available in the SUD_French-Rhapsodie repository.
Prosodic annotations from the original project were imported into the SUD data in 2025. This work is described in the TLT paper:
Maria Paz Botero-Garcia, Emmett Strickland, Bruno Guillaume, Sylvain Kahane, and Anne Lacheret-Dujour. 2025. An intonosyntactic treebank for spoken French: What is new with Rhapsodie?. In Proceedings of the 23rd International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2025), pages 111–118, Ljubljana, Slovenia. Association for Computational Linguistics.
The richest annotations are available in the prosody_pauses folder in the SUD repository.
Several other versions are automatically built from it.
The table below outlines the various available formats and their production methods.
| Treebanks | Desc | Files | Production |
|---|---|---|---|
| SUD_French-Rhapsodie-prosody_pauses | SUD Syntax + Prosody (including pauses) | prosody_pauses/*.conllu |
Source data |
| SUD_French-Rhapsodie-prosody | SUD Syntax + Prosody | prosody/*.conllu |
grs/remove_pauses.grs |
| SUD_French-Rhapsodie@p_words | SUD Syntax (on phonological words) |
p_words/*.conllu |
grs/remove_syllables.grs |
| SUD_French-Rhapsodie@latest | SUD Syntax | *.conllu |
grs/split_amalgam.grs |
| UD_French-Rhapsodie@conv | UD Syntax |
*.conllu in UD repo
|
fr_SUD_to_UD.grs in converter |
au or du) are not split into syntactic words (à+le or de+le) as expected in UD.