Skip to content

WayScience/jump_image_data_downloader

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

jump-image-datasets

jump-image-datasets provides packaged JUMP pilot metadata and utilities for downloading image files from metadata tables.

Install

Install from PyPI

pip install jump-image-datasets

Install from PyPI for stable, versioned releases.

Local development with uv

uv venv
uv sync --group test

Editable install

uv pip install -e .

Install from the GitHub repo with pip

pip install "git+https://github.com/WayScience/jump_image_data_downloader.git"

Install from GitHub if you want the latest unreleased changes.

Usage

from jump_image_datasets.jump_pilot import image_downloader, image_metadata

# Load packaged metadata parquet as a DataFrame.
metadata_df = image_metadata.load_metadata()

# Download a small subset.
summary = image_downloader.download_images_with_metadata(
    df=metadata_df.head(10),
    url_column="Metadata_FileUrl",
    default_output_dir="downloaded_jump_pilot_images",
    parallel=True,
    workers=8,
)
print(summary)

For a full runnable example, see docs/download_images_examples.ipynb.

Packaged metadata provenance

This repository ships a packaged metadata table at:

  • src/jump_image_datasets/jump_pilot/data/2020_11_04_CPJUMP1_all_plates.parquet

Why this file exists

The file is included so users can immediately load a stable JUMP pilot metadata table (via jump_image_datasets.jump_pilot.image_metadata) without requiring a separate data-fetch or preprocessing step.

How it was created

This parquet was generated from the JUMP Cell Painting Gallery using:

Upstream source pattern used by that notebook:

  • s3://cellpainting-gallery/cpg0000-jump-pilot/source_4/workspace/load_data_csv/2020_11_04_CPJUMP1/*/load_data.csv

Transform summary

The generation workflow in 2.download_image_metadata.ipynb:

  • Lists all per-plate load_data.csv files for run 2020_11_04_CPJUMP1 (51 files in the captured run) from public S3 (anon=True).
  • Reads each plate CSV, appends provenance columns:
    • source_plate (plate ID parsed from path)
    • source_s3_path (full S3 CSV path)
  • Concatenates all plate tables into one DataFrame.
  • Reshapes channel URL columns from wide to long using melt:
    • URL columns become Metadata_ChannelURLName
    • URL values become Metadata_FileUrl
  • Adds normalized channel/stain annotations by mapping URL column names:
    • Metadata_ChannelName: ER, AGP, Mito, DNA, RNA, BF, HZ_BF, LZ_BF
    • Metadata_StainName: corresponding stain labels (or NA for brightfield channels)
  • Derives Metadata_Filename from the final path component of Metadata_FileUrl.
  • Writes parquet with index=False as data/2020_11_04_CPJUMP1_all_plates.parquet (captured shape: (1495400, 32)).

About

Allows JUMP users to download JUMP images by filtering JUMP image metadata.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages