Skip to content

Latest commit

 

History

History
 
 

README.md

Text Curation Tutorials

Hands-on tutorials for curating text data with NeMo Curator. Complete working examples with detailed explanations.

Quick Start

New to text curation? Start with the Text Getting Started Guide for setup and basic concepts.

Available Tutorials

Tutorial Description Files
Download & Extract Data acquisition workflows download_extract_tutorial.ipynb
Deduplication Remove duplicate content Fuzzy and semantic deduplication notebooks
Classification Quality assessment and categorization quality-classification.ipynb, domain-classification.ipynb, fineweb-edu-classification.ipynb, and more
PEFT Curation Instruction-tuning data preparation main.py, stages.py
TinyStories End-to-end processing pipeline main.py, stages.py
Megatron Tokenizing Tokenization pipeline that produces Megatron-LM ready files main.py
Llama Nemotron Data Curation Data curation on the Llama Nemotron Post-Training Dataset main.py and helper files
GLiNER-based PII Redaction Redacting personally identifiable information with NVIDIA's GLiNER-PII model gliner_pii_redaction.ipynb

Documentation Links

Category Links
Concepts ProcessingPipelineLoadingGeneration
Data Sources Common CrawlWikipediaArXivCustom
Processing Quality AssessmentDeduplicationContent ProcessingPII Removal
Advanced Distributed ClassificationSemantic DedupGPU DedupSynthetic Data

Support

Documentation: Main DocsAPI ReferenceGitHub Discussions