Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md
helper.py	helper.py
image_curation_example.py	image_curation_example.py
image_dedup_example.py	image_dedup_example.py

Getting Started with Image Curation

The Python scripts in this directory contain examples for how to run typical image curation workflows with NeMo Curator. In particular:

image_curation_example.py implements a pipeline to read images, generate their CLIP embeddings, score and filter the images by aesthetic and NSFW content, and save the results
image_dedup_example.py implements a pipeline to read images, generate their CLIP embeddings, filter semantic duplicates (i.e., images which look similar in content to other images within the dataset), and save the results
helper.py contains functions for downloading and saving image data used by the above scripts

Note: Run these examples on GPUs for best performance.

Download and preprocess data

Use the following commands to download a sample Parquet of image URLs from MSCOCO dataset and prepare a deduplicated (removing duplicating URLs), truncated subset for quick experiments.

mkdir -p ./example_data/ && wget https://huggingface.co/datasets/ChristophSchuhmann/MS_COCO_2017_URL_TEXT/resolve/main/mscoco.parquet -O ./example_data/mscoco.parquet

import pandas as pd

NUM_URLS = 100_000
urls = pd.read_parquet("./example_data/mscoco.parquet")
deduplicated_urls = urls[~urls["URL"].duplicated()]
truncated_urls = deduplicated_urls[:NUM_URLS]
truncated_urls.to_parquet("./example_data/truncated_100k_mscoco.parquet")

Download and prepare relevant models weights

Users can manually download or provide specific models. If not provided, the models can be auto-downloaded in the first run.

# LAION NSFW detector weights
mkdir -p ./model_weights/laion/clip-autokeras-binary-nsfw && \
wget -qO ./model_weights/laion/clip-autokeras-binary-nsfw/clip_autokeras_binary_nsfw.zip \
  https://github.com/LAION-AI/CLIP-based-NSFW-Detector/files/10250461/clip_autokeras_binary_nsfw.zip && \
unzip -o ./model_weights/laion/clip-autokeras-binary-nsfw/clip_autokeras_binary_nsfw.zip -d ./model_weights/laion/clip-autokeras-binary-nsfw

# OpenAI CLIP ViT-L/14 weights
mkdir -p ./model_weights/openai && \
python -c "from huggingface_hub import snapshot_download; snapshot_download('openai/clip-vit-large-patch14', local_dir='./model_weights/openai/clip-vit-large-patch14', force_download=True)"

# Aesthetic predictor weights
mkdir -p ./model_weights/ttj && \
python -c "from huggingface_hub import snapshot_download; snapshot_download('ttj/sac-logos-ava1-l14-linearMSE', local_dir='./model_weights/ttj/sac-logos-ava1-l14-linearMSE', force_download=True)"

Note: users can use unzip package (can be installed with apt-get update && apt-get install unzip -y) or any other compressing package

Run the scripts

Note on Batch Sizes: The batch sizes used in both workflows below are conservative limits set for typical GPUs with 24-48 GB of VRAM (e.g., RTX 4090, A6000, RTX A5000). You can tune these based on your available GPU memory:

High-memory GPUs (80 GB+) like H100, B200, or A100 80GB: Increase batch sizes for better performance (e.g., --batch-size 500 --embedding-batch-size 500 --aesthetic-batch-size 500 --nsfw-batch-size 500)
Lower-memory GPUs (16 GB or less): Reduce batch sizes further (e.g., --batch-size 16 --embedding-batch-size 16)

Run the image curation pipeline on GPUs (extracting embeddings, NSFW and aesthetics scores, filtering based on thresholds) using the following commands. Users can set --verbose for progress information report during running the pipeline.

python tutorials/image/getting-started/image_curation_example.py \
    --input-parquet ./example_data/truncated_100k_mscoco.parquet \
    --input-wds-dataset-dir ./example_data/truncated_100k_mscoco \
    --output-dataset-dir ./example_data/results_truncated_100k_mscoco \
    --model-dir ./model_weights \
    --tar-files-per-partition 10 \
    --batch-size 32 \
    --embedding-batch-size 32 \
    --aesthetic-batch-size 32 \
    --nsfw-batch-size 32 \
    --aesthetic-threshold 0.9 \
    --nsfw-threshold 0.9 \
    --images-per-tar 1000

Run the image deduplication pipeline on GPUs (extracting embeddings, running semantic deduplication, removing duplicated samples):

python tutorials/image/getting-started/image_dedup_example.py \
    --input-wds-dataset-dir ./example_data/truncated_100k_mscoco \
    --output-dataset-dir ./example_data/results_truncated_100k_mscoco \
    --embeddings-dir ./example_data/dedup/embeddings/truncated_100k_mscoco \
    --removal-parquets-dir ./example_data/dedup/removal_ids/truncated_100k_mscoco \
    --model-dir ./model_weights \
    --batch-size 32 \
    --embedding-batch-size 32 \
    --tar-files-per-partition 10 \
    --skip-download

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Getting Started with Image Curation

Download and preprocess data

Download and prepare relevant models weights

Run the scripts

FilesExpand file tree

getting-started

Directory actions

More options

Directory actions

More options

Latest commit

History

getting-started

Folders and files

parent directory

README.md

Getting Started with Image Curation

Download and preprocess data

Download and prepare relevant models weights

Run the scripts