Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Distributed Data Classification

The following is a set of Jupyter notebook tutorials which demonstrate how to use various text classification models supported by NeMo Curator. The goal of using these classifiers is to help with data annotation, which is useful in data blending for foundation model training.

Each of these classifiers are available on Hugging Face and can be run independently with the Transformers library. By running them with NeMo Curator, the classifiers are accelerated using a heterogenous pipeline setup where tokenization is run across CPUs and model inference is run across GPUs. Each of the Jupyter notebooks in this directory demonstrate how to run the classifiers on text data and are easily scalable to large amounts of data.

Before running any of these notebooks, see this Installation Guide page for instructions on how to install NeMo Curator. Be sure to use an installation method which includes GPU dependencies.

For more information about the classifiers, refer to our Distributed Data Classification documentation page.

List of Classifiers

NeMo Curator Classifier Hugging Face Page
AegisClassifier nvidia/Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0 and nvidia/Aegis-AI-Content-Safety-LlamaGuard-Permissive-1.0
ContentTypeClassifier nvidia/content-type-classifier-deberta
DomainClassifier nvidia/domain-classifier
FineWebEduClassifier HuggingFaceFW/fineweb-edu-classifier
FineWebMixtralEduClassifier nvidia/nemocurator-fineweb-mixtral-edu-classifier
FineWebNemotronEduClassifier nvidia/nemocurator-fineweb-nemotron-4-edu-classifier
InstructionDataGuardClassifier nvidia/instruction-data-guard
MultilingualDomainClassifier nvidia/multilingual-domain-classifier
PromptTaskComplexityClassifier nvidia/prompt-task-and-complexity-classifier
QualityClassifier quality-classifier-deberta