This tutorial demonstrates the usage of NeMo Curator's Python API to curate the TinyStories dataset. TinyStories is a dataset of short stories generated by GPT-3.5 and GPT-4, featuring words that are understood by 3 to 4-year olds. The small size of this dataset makes it ideal for creating and validating data curation pipelines on a local machine.
For simplicity, this tutorial uses the validation split of this dataset, which contains around 22,000 samples.
After installing the NeMo Curator package, you can simply run the following command:
LOGURU_LEVEL="ERROR" python tutorials/text/tinystories/main.py
This will download the validation split of the TinyStories dataset and begin the data curation pipeline.
We use LOGURU_LEVEL="ERROR" to help minimize console output and produce cleaner logs for the user.