Skip to content

lydianish-phd/robust-embeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

890 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Robust Sentence Embeddings for User-Generated Content (UGC)

This repository contains the code and experiments for my PhD work on robust sentence embeddings for user-generated content (UGC), focusing on aligning standard and non-standard language in a shared semantic space.
It covers Experiment V (RoLASER) and Experiment VI (RoSONAR) from my thesis Chapter 8: Towards Translating UGC with Robust Sentence Embeddings.

πŸŽ“ Read the full thesis here: Robust Neural Machine Translation of User-Generated Content.


πŸ“‘ Table of Contents

  1. πŸ“ Repository Structure
  2. πŸ” Motivation
  3. 🧠 Core Idea
  4. πŸ§ͺ Experiments
  5. πŸ“„ Citation
  6. πŸ‘€ Author
  7. ⚠️ Notes & Limitations

πŸ“ Repository Structure

The repository is organised by model and experiment, with a clear separation between source code and SLURM job scripts:

β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ rolaser/   # Preprocessing, training, and evaluation code for RoLASER
β”‚   β”œβ”€β”€ rosonar/   # Preprocessing, training, and evaluation code for RoSONAR
β”‚   └── mining/    # Experimental scripts for bitext mining using RoLASER (standard ↔ UGC); unfinished
β”œβ”€β”€ slurm/
β”‚   β”œβ”€β”€ rolaser/   # SLURM scripts for RoLASER training and experiments
β”‚   β”œβ”€β”€ rosonar/   # SLURM scripts for RoSONAR training and experiments
β”‚   └── mining/    # SLURM scripts for RoLASER bitext mining experiments; unfinished
β”œβ”€β”€ img/           # Figures used in the README
β”œβ”€β”€ .gitignore
└── README.md

Note: The src/mining/ and slurm/mining/ directories contain scripts for exploratory bitext mining using RoLASER to align standard and UGC sentences. These experiments were unfinished.


πŸ” Motivation

Most sentence encoders are trained on clean, standard text and degrade sharply when applied to UGC such as social media content, which exhibits:

  • spelling and grammatical errors,
  • slang, acronyms, and abbreviations,
  • expressive typography (emojis, repetitions, leetspeak),
  • tokenisation-breaking character-level perturbations.

This work tackles robustness at the sentence level, framing UGC robustness as a bitext alignment problem in embedding space:

How close are the embeddings of a standard sentence and its non-standard counterpart?


🧠 Core Idea

We propose training UGC-robust sentence encoders using:

  • Knowledge distillation (teacher–student training),
  • Synthetic UGC generation from standard text,
  • Embedding alignment losses that minimise the distance between standard and non-standard variants.

Rather than normalising text explicitly, the model learns to abstract away surface-level variation.


πŸ§ͺ Experiments

🧩 Synthetic UGC Generation

Data augmentation technique

Approach: Due to the scarcity of natural user-generated content (UGC) data, we artificially generate non-standard English sentences from standard text. This allows us to train and evaluate models on a wide range of UGC phenomena without relying solely on limited real-world datasets.

We apply a set of probabilistic transformations to standard sentences, including:

  • Abbreviations, acronyms, and slang
  • Contractions and expansions
  • Misspellings and typos
  • Visual and segmentation perturbations

We also use a mix_all transformation, which randomly applies a subset of these perturbations in shuffled order, simulating realistic UGC variation.

These synthetic datasets enable controlled experimentation, allowing us to measure model robustness by UGC phenomenon type and address the lack of large-scale annotated non-standard text.

Note: The scripts for synthetic UGC generation and data augmentation are available in a separate repository: https://github.com/lydianish-phd/data-preparation

🧩 RoLASER

RoLASER is a Transformer-based student encoder trained to map non-standard English sentences close to their standard equivalents in the LASER embedding space.

RoLASER Teacher-Student Distillation

πŸ”— RoLASER Paper: Making Sentence Embeddings Robust to User-Generated Content, LREC-COLING 2024.

πŸ”— RoLASER Demo GitHub repository: https://github.com/lydianish-phd/RoLASER

Note: The separate RoLASER GitHub repo linked above is the official demo released with the paper and is intended for demonstration purposes, while this repository contains the full research code used in the thesis.

πŸ”§ Experimental Setup

Variants (students):

Model Input type Architecture
RoLASER Token-level RoBERTa-style Transformer
c-RoLASER Character-level CNN + Transformer (CharacterBERT)
  • Teacher: frozen LASER encoder
  • Objective: Minimise MSE loss between teacher and student embeddings
  • Data: 2M standard sentences from OSCAR, augmented with synthetic UGC phenomena (12 transformation types)
  • Framework: Fairseq, multi-GPU training

πŸ”¬ Evaluation & Findings

Metrics: cosine distance, xSIM / xSIM++
Datasets: MultiLexNorm, ROCS-MT (natural UGC), FLORES artificial UGC
Downstream tasks:

  1. Sentence classification (e.g., TweetSentimentExtraction)
  2. Sentence pair classification (e.g., paraphrase detection)
  3. Semantic Textual Similarity (STS)

Main findings:

  • RoLASER substantially improves robustness to synthetic and natural UGC
  • Handles tokenisation-breaking perturbations better than LASER
  • Maintains (or slightly improves) performance on standard text
  • Token-level RoLASER outperforms character-aware c-RoLASER in most settings
  • Character-level models are internally robust but struggle to map outputs to LASER space

🧩 RoSONAR

RoSONAR extends the RoLASER approach to machine translation. We first train a robust, bilingual English–French sentence encoder aligned with SONAR:

RoSONAR Encoder-Decoder

and pair it with a frozen multilingual SONAR decoder:

RoSONAR Encoder-Decoder

πŸ”§ Experimental Setup

  • Teacher: Multilingual SONAR encoder
  • Student: Smaller bilingual encoder (half the layers of SONAR)
  • Objective: Minimise MSE between teacher and student embeddings for standard and non-standard sentences

Architecture & Training:

  • 12 Transformer layers, 16 attention heads, hidden size 1024, FFN 8192
  • Tokeniser: SentencePiece, vocab size 256k
  • Encoder parameters: 514M; combined with SONAR decoder: 1.643B
  • Optimisation: AdamW, LR 7e-3, BF16 mixed precision, 16 H100 GPUs, effective batch size 1M tokens

Data:

  • Parallel English↔French data from NLLB (β‰ˆ329M sentence pairs)
  • Monolingual English/French data from OSCAR (36M Fr, 24M En sentences after decontamination).
  • Synthetic English UGC is generated via 19 probabilistic transformations (character and word-level perturbations)
  • Interleaved batches from parallel, monolingual, and synthetic sets; 100k sentences held out for validation.

Framework: Transformers, Fairseq2, multi-node and multi-GPU training

πŸ”¬ Evaluation & Findings

  • Evaluated on standard, synthetic, and natural UGC datasets (MultiLexNorm, ROCS-MT, FLORES)
  • Compared models: RoSONAR, RoSONAR-std (trained only on standard data), SONAR, NLLB

Main results:

  1. RoSONAR vs RoSONAR-std: nearly identical (<0.5 COMET points), showing limited impact of synthetic UGC at this scale
  2. Compared to baselines: both RoSONAR variants slightly underperform SONAR but remain close to NLLB; some gains on highly non-standard French (PFS-MB)
  3. Robustness to perturbations: large models degrade slowly under synthetic UGC; RoSONAR similar to SONAR with limited extra gains
  4. Cross-lingual transfer: robustness from English synthetic UGC does not reliably transfer to French
  5. Implication: encoder-level robustness alone is insufficient; natural UGC and domain adaptation are crucial

Key takeaway: Large-scale encoders already have inherent robustness to surface-level non-standardness. Synthetic UGC objectives show diminishing returns at scale; in-domain natural UGC is necessary for meaningful MT robustness.


πŸ“„ Citation

If you use the RoLASER model or ideas from this work, please cite the following paper:

@inproceedings{nishimwe-etal-2024-making-sentence,
    title = "Making Sentence Embeddings Robust to User-Generated Content",
    author = "Nishimwe, Lydia  and
      Sagot, Beno{\^\i}t  and
      Bawden, Rachel",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.958",
    pages = "10984--10998"
}

If you use the RoSONAR model or ideas from this work, please cite the PhD thesis:

@phdthesis{nishimwe:tel-05448644,
  TITLE = {{Robust Neural Machine Translation of User-Generated Content}},
  AUTHOR = {Nishimwe, Lydia},
  URL = {https://theses.hal.science/tel-05448644},
  NUMBER = {2025SORUS369},
  SCHOOL = {{Sorbonne Universit{\'e}}},
  YEAR = {2025},
  MONTH = Jun,
  KEYWORDS = {Lexical normalisation ; User-Generated content ; Deep learning ; Robustness ; Non standard texts ; Machine translation ; Normalisation lexicale ; Contenus g{\'e}n{\'e}r{\'e}s par les utilisateurs ; Apprentissage profond ; Robustesse ; Textes non standards ; Traduction automatique},
  TYPE = {Theses},
  PDF = {https://theses.hal.science/tel-05448644v1/file/144629_NISHIMWE_2025_archivage.pdf},
  HAL_ID = {tel-05448644},
  HAL_VERSION = {v1},
}

πŸ‘€ Author

Lydia Nishimwe
PhD in Machine Translation & NLP
Focus: UGC robustness, sentence embeddings, multilingual NLP

πŸ”— Personal GitHub: https://github.com/lydianish
πŸ”— PhD organisation: https://github.com/lydianish-phd


⚠️ Notes & Limitations

  • Synthetic UGC does not fully capture real-world UGC distributions
  • Robust embeddings do not automatically guarantee robust MT
  • Domain adaptation remains a key challenge

This repository should be viewed as a research artefact supporting the dissertation rather than a polished end-user library.

Releases

No releases published

Packages

 
 
 

Contributors