Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

README.md

TinyStories

This tutorial demonstrates the usage of NeMo Curator's Python API to curate the TinyStories dataset. TinyStories is a dataset of short stories generated by GPT-3.5 and GPT-4, featuring words that are understood by 3 to 4-year olds. The small size of this dataset makes it ideal for creating and validating data curation pipelines on a local machine.

For simplicity, this tutorial uses the validation split of this dataset, which contains around 22,000 samples.

Usage

After installing the NeMo Curator package, you can simply run the following command:

LOGURU_LEVEL="ERROR" python tutorials/text/tinystories/main.py

This will download the validation split of the TinyStories dataset and begin the data curation pipeline.

We use LOGURU_LEVEL="ERROR" to help minimize console output and produce cleaner logs for the user.