This repository contains the code and experiments for my PhD work on robust sentence embeddings for user-generated content (UGC), focusing on aligning standard and non-standard language in a shared semantic space.
It covers Experiment V (RoLASER) and Experiment VI (RoSONAR) from my thesis Chapter 8: Towards Translating UGC with Robust Sentence Embeddings.
π Read the full thesis here: Robust Neural Machine Translation of User-Generated Content.
- π Repository Structure
- π Motivation
- π§ Core Idea
- π§ͺ Experiments
- π Citation
- π€ Author
β οΈ Notes & Limitations
The repository is organised by model and experiment, with a clear separation between source code and SLURM job scripts:
βββ src/
β βββ rolaser/ # Preprocessing, training, and evaluation code for RoLASER
β βββ rosonar/ # Preprocessing, training, and evaluation code for RoSONAR
β βββ mining/ # Experimental scripts for bitext mining using RoLASER (standard β UGC); unfinished
βββ slurm/
β βββ rolaser/ # SLURM scripts for RoLASER training and experiments
β βββ rosonar/ # SLURM scripts for RoSONAR training and experiments
β βββ mining/ # SLURM scripts for RoLASER bitext mining experiments; unfinished
βββ img/ # Figures used in the README
βββ .gitignore
βββ README.mdNote: The src/mining/ and slurm/mining/ directories contain scripts for exploratory bitext mining using RoLASER to align standard and UGC sentences. These experiments were unfinished.
Most sentence encoders are trained on clean, standard text and degrade sharply when applied to UGC such as social media content, which exhibits:
- spelling and grammatical errors,
- slang, acronyms, and abbreviations,
- expressive typography (emojis, repetitions, leetspeak),
- tokenisation-breaking character-level perturbations.
This work tackles robustness at the sentence level, framing UGC robustness as a bitext alignment problem in embedding space:
How close are the embeddings of a standard sentence and its non-standard counterpart?
We propose training UGC-robust sentence encoders using:
- Knowledge distillation (teacherβstudent training),
- Synthetic UGC generation from standard text,
- Embedding alignment losses that minimise the distance between standard and non-standard variants.
Rather than normalising text explicitly, the model learns to abstract away surface-level variation.
Approach: Due to the scarcity of natural user-generated content (UGC) data, we artificially generate non-standard English sentences from standard text. This allows us to train and evaluate models on a wide range of UGC phenomena without relying solely on limited real-world datasets.
We apply a set of probabilistic transformations to standard sentences, including:
- Abbreviations, acronyms, and slang
- Contractions and expansions
- Misspellings and typos
- Visual and segmentation perturbations
We also use a mix_all transformation, which randomly applies a subset of these perturbations in shuffled order, simulating realistic UGC variation.
These synthetic datasets enable controlled experimentation, allowing us to measure model robustness by UGC phenomenon type and address the lack of large-scale annotated non-standard text.
Note: The scripts for synthetic UGC generation and data augmentation are available in a separate repository: https://github.com/lydianish-phd/data-preparation
RoLASER is a Transformer-based student encoder trained to map non-standard English sentences close to their standard equivalents in the LASER embedding space.
π RoLASER Paper: Making Sentence Embeddings Robust to User-Generated Content, LREC-COLING 2024.
π RoLASER Demo GitHub repository: https://github.com/lydianish-phd/RoLASER
Note: The separate RoLASER GitHub repo linked above is the official demo released with the paper and is intended for demonstration purposes, while this repository contains the full research code used in the thesis.
Variants (students):
| Model | Input type | Architecture |
|---|---|---|
| RoLASER | Token-level | RoBERTa-style Transformer |
| c-RoLASER | Character-level | CNN + Transformer (CharacterBERT) |
- Teacher: frozen LASER encoder
- Objective: Minimise MSE loss between teacher and student embeddings
- Data: 2M standard sentences from OSCAR, augmented with synthetic UGC phenomena (12 transformation types)
- Framework: Fairseq, multi-GPU training
Metrics: cosine distance, xSIM / xSIM++
Datasets: MultiLexNorm, ROCS-MT (natural UGC), FLORES artificial UGC
Downstream tasks:
- Sentence classification (e.g., TweetSentimentExtraction)
- Sentence pair classification (e.g., paraphrase detection)
- Semantic Textual Similarity (STS)
Main findings:
- RoLASER substantially improves robustness to synthetic and natural UGC
- Handles tokenisation-breaking perturbations better than LASER
- Maintains (or slightly improves) performance on standard text
- Token-level RoLASER outperforms character-aware c-RoLASER in most settings
- Character-level models are internally robust but struggle to map outputs to LASER space
RoSONAR extends the RoLASER approach to machine translation. We first train a robust, bilingual EnglishβFrench sentence encoder aligned with SONAR:
and pair it with a frozen multilingual SONAR decoder:
- Teacher: Multilingual SONAR encoder
- Student: Smaller bilingual encoder (half the layers of SONAR)
- Objective: Minimise MSE between teacher and student embeddings for standard and non-standard sentences
Architecture & Training:
- 12 Transformer layers, 16 attention heads, hidden size 1024, FFN 8192
- Tokeniser: SentencePiece, vocab size 256k
- Encoder parameters: 514M; combined with SONAR decoder: 1.643B
- Optimisation: AdamW, LR 7e-3, BF16 mixed precision, 16 H100 GPUs, effective batch size 1M tokens
Data:
- Parallel EnglishβFrench data from NLLB (β329M sentence pairs)
- Monolingual English/French data from OSCAR (36M Fr, 24M En sentences after decontamination).
- Synthetic English UGC is generated via 19 probabilistic transformations (character and word-level perturbations)
- Interleaved batches from parallel, monolingual, and synthetic sets; 100k sentences held out for validation.
Framework: Transformers, Fairseq2, multi-node and multi-GPU training
- Evaluated on standard, synthetic, and natural UGC datasets (MultiLexNorm, ROCS-MT, FLORES)
- Compared models: RoSONAR, RoSONAR-std (trained only on standard data), SONAR, NLLB
Main results:
- RoSONAR vs RoSONAR-std: nearly identical (<0.5 COMET points), showing limited impact of synthetic UGC at this scale
- Compared to baselines: both RoSONAR variants slightly underperform SONAR but remain close to NLLB; some gains on highly non-standard French (PFS-MB)
- Robustness to perturbations: large models degrade slowly under synthetic UGC; RoSONAR similar to SONAR with limited extra gains
- Cross-lingual transfer: robustness from English synthetic UGC does not reliably transfer to French
- Implication: encoder-level robustness alone is insufficient; natural UGC and domain adaptation are crucial
Key takeaway: Large-scale encoders already have inherent robustness to surface-level non-standardness. Synthetic UGC objectives show diminishing returns at scale; in-domain natural UGC is necessary for meaningful MT robustness.
If you use the RoLASER model or ideas from this work, please cite the following paper:
@inproceedings{nishimwe-etal-2024-making-sentence,
title = "Making Sentence Embeddings Robust to User-Generated Content",
author = "Nishimwe, Lydia and
Sagot, Beno{\^\i}t and
Bawden, Rachel",
editor = "Calzolari, Nicoletta and
Kan, Min-Yen and
Hoste, Veronique and
Lenci, Alessandro and
Sakti, Sakriani and
Xue, Nianwen",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.958",
pages = "10984--10998"
}If you use the RoSONAR model or ideas from this work, please cite the PhD thesis:
@phdthesis{nishimwe:tel-05448644,
TITLE = {{Robust Neural Machine Translation of User-Generated Content}},
AUTHOR = {Nishimwe, Lydia},
URL = {https://theses.hal.science/tel-05448644},
NUMBER = {2025SORUS369},
SCHOOL = {{Sorbonne Universit{\'e}}},
YEAR = {2025},
MONTH = Jun,
KEYWORDS = {Lexical normalisation ; User-Generated content ; Deep learning ; Robustness ; Non standard texts ; Machine translation ; Normalisation lexicale ; Contenus g{\'e}n{\'e}r{\'e}s par les utilisateurs ; Apprentissage profond ; Robustesse ; Textes non standards ; Traduction automatique},
TYPE = {Theses},
PDF = {https://theses.hal.science/tel-05448644v1/file/144629_NISHIMWE_2025_archivage.pdf},
HAL_ID = {tel-05448644},
HAL_VERSION = {v1},
}Lydia Nishimwe
PhD in Machine Translation & NLP
Focus: UGC robustness, sentence embeddings, multilingual NLP
π Personal GitHub: https://github.com/lydianish
π PhD organisation: https://github.com/lydianish-phd
- Synthetic UGC does not fully capture real-world UGC distributions
- Robust embeddings do not automatically guarantee robust MT
- Domain adaptation remains a key challenge
This repository should be viewed as a research artefact supporting the dissertation rather than a polished end-user library.



