Hands-on tutorials for curating text data with NeMo Curator. Complete working examples with detailed explanations.
New to text curation? Start with the Text Getting Started Guide for setup and basic concepts.
| Tutorial | Description | Files |
|---|---|---|
| Download & Extract | Data acquisition workflows | download_extract_tutorial.ipynb |
| Deduplication | Remove duplicate content | Fuzzy and semantic deduplication notebooks |
| Classification | Quality assessment and categorization | quality-classification.ipynb, domain-classification.ipynb, fineweb-edu-classification.ipynb, and more |
| PEFT Curation | Instruction-tuning data preparation | main.py, stages.py |
| TinyStories | End-to-end processing pipeline | main.py, stages.py |
| Megatron Tokenizing | Tokenization pipeline that produces Megatron-LM ready files | main.py |
| Llama Nemotron Data Curation | Data curation on the Llama Nemotron Post-Training Dataset | main.py and helper files |
| GLiNER-based PII Redaction | Redacting personally identifiable information with NVIDIA's GLiNER-PII model | gliner_pii_redaction.ipynb |
| Category | Links |
|---|---|
| Concepts | Processing • Pipeline • Loading • Generation |
| Data Sources | Common Crawl • Wikipedia • ArXiv • Custom |
| Processing | Quality Assessment • Deduplication • Content Processing • PII Removal |
| Advanced | Distributed Classification • Semantic Dedup • GPU Dedup • Synthetic Data |
Documentation: Main Docs • API Reference • GitHub Discussions