diff --git a/cosmos_curate/.gitignore b/cosmos_curate/.gitignore new file mode 100644 index 0000000..3a3397f --- /dev/null +++ b/cosmos_curate/.gitignore @@ -0,0 +1 @@ +cosmos-curate diff --git a/cosmos_curate/README.md b/cosmos_curate/README.md new file mode 100644 index 0000000..9eebb1c --- /dev/null +++ b/cosmos_curate/README.md @@ -0,0 +1,150 @@ +# Cosmos Curate + +This repository is an example of running NVIDIA [cosmos-curate](https://github.com/nvidia-cosmos/cosmos-curate) pipelines on Anyscale. Examples include the Hello World and Reference Video Pipelines. + +## Prerequisites + +- An [Anyscale account](https://console.anyscale.com/) with the `anyscale` CLI installed (`pip install anyscale`) +- AWS account with ECR access (for pushing the Docker image) and S3 permissions on the nodes +- A local clone (or symlink, e.g. `ln -sf /path/to/cosmos-curate ./cosmos-curate`) of the [`cosmos-curate`](https://github.com/NVIDIA/cosmos-curate) repo (see [Runtime Environment](#3-runtime-environment) for details) + +Your directory layout should look like: +``` +cosmos_curate/ # this directory +├── cosmos-curate/ # clone of the cosmos-curate repo +├── docker/ +├── hello_world.yaml +├── reference_pipeline.yaml +├── all_nodes_init_script.py +├── cosmos_curate_tokens.yaml +└── ... +``` + +## Setup + +### 1. Docker image + +Update the ECR information in `./push_anyscale.sh` with your own repo then: +``` +TAG=1 +./build_anyscale.sh $TAG && ./push_anyscale.sh $TAG +``` + +The `anyscale-cosmos-curate.Dockerfile` adds [Anyscale requirements](https://docs.anyscale.com/container-image/image-requirement) prior to building the `pixi` layers as `chown`'ing these layers later almost doubles the image size. This image used `./generate_dockerfile.sh` from `cosmos-curate` repo to generate the `cosmos-curate.Dockerfile` without `cuml` env then added the Anyscale portion to that generated Dockerfile. + +Can update the jobs `image_uri:` with your image once it is built and pushed. + +### 2. cosmos_curate.yaml (API auth) + +`cosmos-curate` expects `/cosmos_curate/config/cosmos_curate.yaml` to control the authentication to APIs and model registrys. `huggingface` is all that is required to run the two examples in this repo. Can add your credentials locally and when the job runs there `entrypoint:` will distributed to all nodes at that path with `all_nodes_init_script.py`. + +``` +μ cat cosmos_curate_tokens.yaml +huggingface: + user: "" + api_key: "" +``` + +### 3. s3_creds_file.yaml (S3 auth) + +`cosmos-curate` expects an S3 credential file at `/dev/shm/s3_creds_file`. This is configurable by `COSMOS_S3_PROFILE_PATH`. For this examplet the jobs run on AWS where the IAM has S3 permissions so we use the `aws` cli to write out temporary crednetials for the job to this path. + +If you need to authenticate in a different way need to ensure this file is written and distributed to all nodes at the expected filepath. + +## Run + +The Hello World Pipeline runs in a few minutes and only requires 1 T4 GPU node. + +``` +anyscale job submit -f hello_world.yaml +``` + +The Reference Video Pipeline will take ~45m with the default setup of 4 L40S GPUs on ~3h of video. +``` +anyscale job submit -f reference_pipeline.yaml +``` + +## How It Works + +### Cosmos Curate on Anyscale + +Let's breakdown the the `reference_video_pipeline.yaml` to get a sense for how the setup comes together, starting from defining the hardware we want to use up to the user code defining the pipeline. + +### 1. Compute Config + +This defines the nodes we will require to run the pipeline. Typically the head nodes in Ray clusters should be set to have zero resources, but the `cosmos-curate` library expects it. +``` +compute_config: + head_node: + instance_type: m5.2xlarge + resources: + CPU: 8 + GPU: 0 + flags: {} + worker_nodes: + - instance_type: g6e.4xlarge + flags: {} + min_nodes: 4 + max_nodes: 4 + market_type: ON_DEMAND +``` + +The reference video pipeline defaults to 4 1xL40S instances. The logs at end of pipeline will report on runtimes. Here is 4 GPUs compared to 16 GPUs for ~1k videos, about 3h of video: + +4 GPUs takes 44m +``` +2026-02-28 19:12:46.030 | INFO | cosmos_curate.pipelines.video.splitting_pipeline:split:703 - Split-Transcode-Filter-Annotate pipeline: input_build_time=0.01 / pipeline_run_time=44.26 / summary_run_time=0.02 mins processing time for total_video_length=3.191 hours of raw videos +``` + +16 GPUs took 13m +``` +2026-03-01 05:56:58.599 | INFO | cosmos_curate.pipelines.video.splitting_pipeline:split:703 - Split-Transcode-Filter-Annotate pipeline: input_build_time=0.01 / pipeline_run_time=12.71 / summary_run_time=0.01 mins processing time for total_video_length=3.191 hours of raw videos +``` + +### 2. Image + +This block defines name of job, the image all the nodes will start and clarifies for a custom image built the expected Ray version we will be running. +``` +image_uri: 367974485317.dkr.ecr.us-west-2.amazonaws.com/anyscale-cosmos-curate:6 +ray_version: 2.48.0 +``` + +When the job runs it will acquire all the nodes and use our image which handles a few things for us: +* All `pixi` environments are built into the container +* While on Anyscale we typically just use `working_dir` or `py_modules` to ship code for use at runtime, `cosmos-curate` expects code at `/opt/cosmos-curate/cosmos_curate` so there is a copy of the code in there as well for referencing the `all_models.json` file and some other configurations. +* We set `PIXI_PROJECT_MANIFEST` in the image so that runtime `pixi run` calls (whether by the `entrypoint:` or in the pipeline model classes `py_executable` to enable switching between different envs for specific models) all know where these environments are built and cached. The `default` `pixi` environment is the default `python` on `PATH` if you call `python` directly outside of `pixi run`. + +### 3. Runtime Environment + +Anyscale will ship your `working_dir` which should be the `examples/cosmos-curate/` directory. This allows us to access files for setting up the nodes, addition python scripts to run, python packages, etc. This allows us to generally update code running on the image without requiring rebuild. + +`py_modules` packages a local clone of the `cosmos-curate` repo (the `./cosmos-curate` directory listed in [Prerequisites](#prerequisites)) and ships it to all nodes at runtime. This lets you iterate on `cosmos-curate` source code without rebuilding the Docker image, overriding the copy baked into the image at `/opt/cosmos-curate/cosmos_curate/`. + +``` +py_modules: ["./cosmos-curate"] +working_dir: "." +``` + +### 4. Entrypoint + +The `entrypoint:` will be executed on the head node only. Typically this might be as simple as `entrypoint: python main.py`, but for `cosmos-curate` we want to coordinate some startup logic so use Ray to distribute initialization logic before executing the main entrypoint from the `cosmos_curate` library. + +``` +entrypoint: > + python all_nodes_init_script.py qwen2.5_vl,transnetv2,internvideo2_mm,bert + && pixi run python -m cosmos_curate.pipelines.video.run_pipeline split + --input-video-path "s3://ray-example-data/videos/Hollywood2-actions-videos/Hollywood2/AVIClips/" + --output-clip-path "/mnt/user_storage/output_clips/" +``` + +#### python all_nodes_init_script.py + +The `all_nodes_init_script.py` handles a few initialization steps for the cluster: + +1. Use `write_s3_creds_file.sh` to put an S3 credential file where it is expected on each node. +2. Copy our local `cosmos_curate_tokens.yaml` to the expected location on each not for API and model registry auth. +3. Use the `model-download` `pixi` env to run `python -m cosmos_curate.core.managers.model_cli download` and pass the list of models needed for the pipeline we are going to run (if you do not specify the models it will download all models which takes a while and 500GB+ of space). + +#### python -m cosmos_curate.pipelines.video.run_pipeline split + +Now the actual pipelne uses the default `pixi` env to run `python -m cosmos_curate.pipelines.video.run_pipeline split`. There are many cli options you can pass to the pipelines `cosmos-curate` provides, but here we just set the minimal input and output paths and accept the rest as default. diff --git a/cosmos_curate/all_nodes_init_script.py b/cosmos_curate/all_nodes_init_script.py new file mode 100644 index 0000000..412810e --- /dev/null +++ b/cosmos_curate/all_nodes_init_script.py @@ -0,0 +1,37 @@ +import sys +import ray +import subprocess +from time import perf_counter as pc + +SCRIPT = """ +set -e +bash write_s3_creds_file.sh +cp cosmos_curate_tokens.yaml /cosmos_curate/config/cosmos_curate.yaml +pixi run -e model-download python -m cosmos_curate.core.managers.model_cli download --models {models} +""" + +@ray.remote(num_cpus=0) +def run_init(script): + try: + subprocess.check_call(script, shell=True, stderr=subprocess.STDOUT) + except subprocess.CalledProcessError as e: + raise RuntimeError(f"Init script failed (exit code {e.returncode})") from None + +if __name__ == "__main__": + models = sys.argv[1] + script = SCRIPT.format(models=models) + t = pc() + ray.init(address="auto") + nodes = [n for n in ray.nodes() if n["Alive"]] + tasks = [ + run_init.options( + scheduling_strategy=ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy( + node_id=n["NodeID"], soft=False + ) + ).remote(script) + for n in nodes + ] + print(f"Downloading models on {len(tasks)} nodes...") + ray.get(tasks) + dur = pc() - t + print(f"Done. ({dur:0.1f}s)") diff --git a/cosmos_curate/cosmos_curate_tokens.yaml b/cosmos_curate/cosmos_curate_tokens.yaml new file mode 100644 index 0000000..808febe --- /dev/null +++ b/cosmos_curate/cosmos_curate_tokens.yaml @@ -0,0 +1,4 @@ +huggingface: + user: "" + api_key: "" + diff --git a/cosmos_curate/docker/anyscale-cosmos-curate.Dockerfile b/cosmos_curate/docker/anyscale-cosmos-curate.Dockerfile new file mode 100644 index 0000000..9c51c54 --- /dev/null +++ b/cosmos_curate/docker/anyscale-cosmos-curate.Dockerfile @@ -0,0 +1,225 @@ +# Dockerfile template for cosmos-curate +# +# The dockerfile is templated so that we can provide different conda env information. +# Docs on docker best practices: +# - https://linuxhandbook.com/dockerize-python-apps/ +# - https://uwekorn.com/2021/03/01/deploying-conda-environments-in-docker-how-to-do-it-right.html +# - https://cloud.google.com/architecture/best-practices-for-building-containers + +ARG DEBIAN_FRONTEND=noninteractive + +FROM nvcr.io/nvidia/cuda:12.9.1-devel-ubuntu24.04 AS main + +SHELL ["/bin/bash", "-c"] +ENV NVIDIA_DRIVER_CAPABILITIES=compute,video,utility +ENV TZ=America/Los_Angeles +# Get system level packages +RUN apt-get update \ + && apt-get install -y \ + # Needed for opencv + libsm6 libxext6 \ + # Needed because the certs age out sometimes? + ca-certificates \ + # Needed for installing pixi \ + wget \ + # Needed for pip install \ + git \ + # Needed for cuda profiling \ + nsight-systems-2025.3.2 \ + --option=Dpkg::Options::=--force-confdef \ + # Needed to copy model weights using rsync + rsync \ + && update-ca-certificates \ + && apt-get clean \ + && rm -rf /var/lib/apt/lists/* + +# GPU-accelerated ffmpeg (also needed for opencv) +ENV FFMPEG_VERSION=7.0.1 \ + NVCODEC_VERSION=12.1.14.0 +RUN mkdir -p /tmp && chmod 1777 /tmp && \ + apt-get update && \ + apt-get install -y \ + libcrypt-dev \ + autoconf \ + automake \ + build-essential \ + cmake \ + libaom-dev \ + libass-dev \ + libdav1d-dev \ + libdrm-dev \ + libfreetype6-dev \ + libgnutls28-dev \ + libnuma-dev \ + libopenh264-dev \ + libtool \ + libva-dev \ + libvorbis-dev \ + libvpx-dev \ + libwebp-dev \ + pkg-config \ + texinfo \ + vainfo \ + yasm \ + zlib1g-dev && \ + wget -O /tmp/nv-codec-headers.tar.gz https://github.com/FFmpeg/nv-codec-headers/releases/download/n${NVCODEC_VERSION}/nv-codec-headers-${NVCODEC_VERSION}.tar.gz && \ + tar xzvf /tmp/nv-codec-headers.tar.gz -C /tmp/ && \ + cd /tmp/nv-codec-headers-${NVCODEC_VERSION} && \ + make && \ + make install && \ + wget -O /tmp/ffmpeg-snapshot.tar.bz2 https://www.ffmpeg.org/releases/ffmpeg-${FFMPEG_VERSION}.tar.bz2 && \ + tar xjvf /tmp/ffmpeg-snapshot.tar.bz2 -C /tmp/ && \ + cd /tmp/ffmpeg-${FFMPEG_VERSION} && \ + PATH="/usr/local/cuda/bin:$PATH" \ + ./configure \ + --prefix=/usr/local \ + --enable-nonfree \ + --enable-cuda-nvcc \ + --enable-libnpp \ + --enable-libopenh264 \ + --enable-libaom \ + --enable-libdav1d \ + --enable-libvorbis \ + --enable-libvpx \ + --enable-libwebp \ + --enable-vaapi \ + --extra-cflags=-I/usr/local/cuda/include \ + --extra-ldflags=-L/usr/local/cuda/lib64 \ + --extra-libs=-lpthread \ + --extra-libs=-lm \ + --disable-static \ + --enable-shared \ + --disable-doc \ + --disable-debug && \ + make -j$(nproc) && \ + make install && \ + ldconfig && \ + # Clean up + cd / && \ + rm -rf /tmp/ffmpeg* && \ + rm -rf /tmp/nv-codec-headers* && \ + apt-get clean && rm -rf /var/lib/apt/lists/* + +# Install pixi +RUN wget -qO- https://pixi.sh/install.sh | PIXI_HOME=/usr/local PIXI_NO_PATH_UPDATE=1 sh + +# Common ENV variables needed by some ML libs +ENV AM_I_DOCKER=True \ + BUILD_WITH_CUDA=True \ + TORCH_CUDA_ARCH_LIST="8.0;8.6;9.0;10.0+PTX" \ + CUDA_HOME="/usr/local/cuda" \ + XFORMERS_IGNORE_FLASH_VERSION_CHECK="1" \ + VLLM_WORKER_MULTIPROC_METHOD="spawn" \ + VLLM_USE_V1="1" + +# Disable Ray log dedup +ENV RAY_DEDUP_LOGS=0 \ + RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1 \ + RAY_MAX_LIMIT_FROM_API_SERVER=40000 \ + RAY_MAX_LIMIT_FROM_DATA_SOURCE=40000 \ + RAY_DEFAULT_OBJECT_STORE_MAX_MEMORY_BYTES=800000000000 \ + RAY_DEFAULT_OBJECT_STORE_MEMORY_PROPORTION=0.4 \ + RAY_gcs_rpc_server_connect_timeout_s=30 \ + RAY_gcs_rpc_server_reconnect_timeout_s=180 \ + RAY_WARN_BLOCKING_GET_INSIDE_ASYNC=0 \ + XENNA_RAY_METRICS_PORT=9002 + +# boto3 & pbss +ENV AWS_REQUEST_CHECKSUM_CALCULATION='when_required' + +# Set a bunch of env vars so that we cache weights in a workspace +ENV DEFAULT_WORKSPACE_LOC="/config/default_workspace" +ENV HF_HOME="${DEFAULT_WORKSPACE_LOC}/weights/hf_home/" \ + LAION_CACHE_HOME="${DEFAULT_WORKSPACE_LOC}/weights/laion_cache/" + +# Set up pixi environments +COPY pixi.toml pixi.lock /opt/cosmos-curate/ + + +# ========================================================================== +# Anyscale compatibility layer +# Ref: https://docs.anyscale.com/container-image/image-requirement.md +# +# Everything above runs as root. Everything below runs as ray. Critically the pixi envs need to be ray owned. +# ========================================================================== + +# Anyscale system packages +RUN set -euxo pipefail \ + && apt-get update -y \ + && apt-get install -y --no-install-recommends \ + sudo \ + tzdata \ + openssh-client \ + openssh-server \ + zip \ + unzip \ + gdb \ + curl \ + vim \ + && apt-get clean \ + && rm -rf /var/lib/apt/lists/* \ + && mkdir -p /var/run/sshd + +# Rename ubuntu (uid 1000) -> ray and align with Anyscale requirements +# (uid 1000, gid 100, passwordless sudo). +RUN set -euxo pipefail \ + && groupmod -n ray users \ + && usermod -l ray -d /home/ray -m ubuntu \ + && usermod -u 1000 -g 100 ray \ + && usermod -aG sudo ray \ + && echo 'ray ALL=NOPASSWD: ALL' >> /etc/sudoers \ + && chown -R ray:ray /home/ray \ + && chown ray:ray /opt/cosmos-curate + +USER ray + +# ---------- pixi environments (owned by ray, no chown needed) ---------- +# If we install all the environments in a single layer, it's over 20GB and will cause slurm/NVCF to timeout pulling the +# layer. Since the cuml environment is large and needs non-overlapping RAPIDS packages, we install it separately. +RUN cd /opt/cosmos-curate && \ + export CONDA_OVERRIDE_CUDA=12.9.1 && \ + pixi install -e default -e legacy-transformers -e model-download -e transformers -e unified --frozen && \ + pixi clean cache -y + +# ---------- Anyscale Python packages ---------- +# This also validates ray has write access to the pixi environments. +RUN set -euxo pipefail \ + && cd /opt/cosmos-curate \ + && pixi run -e default pip install --no-cache-dir \ + anyscale \ + packaging \ + boto3 \ + google \ + google-cloud-storage \ + terminado \ + && pixi run -e default pip install --no-cache-dir jupyterlab + +RUN cd /tmp \ + && curl -fsSL "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o awscliv2.zip \ + && unzip -q awscliv2.zip \ + && sudo ./aws/install \ + && rm -rf aws awscliv2.zip + +# ---------- cosmos-curate source code ---------- +COPY --chown=ray:ray cosmos_curate /opt/cosmos-curate/cosmos_curate +COPY --chown=ray:ray tests /opt/cosmos-curate/tests +COPY --chown=ray:ray pytest.ini .coveragerc /opt/cosmos-curate/ + +# Workspace shell setup (Anyscale workspace requirement). +RUN set -euxo pipefail \ + && echo 'PROMPT_COMMAND="history -a"' >> /home/ray/.bashrc \ + && echo '[ -e ~/.workspacerc ] && source ~/.workspacerc' >> /home/ray/.bashrc + +RUN sudo mkdir -p /cosmos_curate/config /config /anyscale/init \ + && sudo chown -R ray:ray /cosmos_curate /config /anyscale/init + +# Model registry needed by cosmos-curate at import time +COPY cosmos_curate/configs/all_models.json /opt/cosmos-curate/cosmos_curate/configs/all_models.json + +ENV PATH=/opt/cosmos-curate/.pixi/envs/default/bin:$PATH \ + HOME=/home/ray \ + PIXI_PROJECT_MANIFEST=/opt/cosmos-curate/pixi.toml +WORKDIR /home/ray + +ENTRYPOINT [] +CMD ["bash"] diff --git a/cosmos_curate/docker/build_anyscale.sh b/cosmos_curate/docker/build_anyscale.sh new file mode 100755 index 0000000..3b10ee9 --- /dev/null +++ b/cosmos_curate/docker/build_anyscale.sh @@ -0,0 +1,12 @@ +TAG=${1:-1} +REPO_ROOT=$HOME/git/cosmos-curate +IMAGE=anyscale-cosmos-curate + +docker build \ + --ulimit nofile=65536 \ + --progress=auto \ + --network=host \ + -f anyscale-cosmos-curate.Dockerfile \ + -t ${IMAGE}:$TAG \ + -t ${IMAGE}:latest \ + $REPO_ROOT diff --git a/cosmos_curate/docker/build_cosmos.sh b/cosmos_curate/docker/build_cosmos.sh new file mode 100755 index 0000000..bf7a9b6 --- /dev/null +++ b/cosmos_curate/docker/build_cosmos.sh @@ -0,0 +1,9 @@ +TAG=1 +REPO_ROOT=$HOME/git/cosmos-curate +docker build \ + --ulimit nofile=65536 \ + --network=host \ + -f cosmos-curate.Dockerfile \ + -t cosmos-curate:$TAG \ + -t cosmos-curate:latest \ + $REPO_ROOT diff --git a/cosmos_curate/docker/cosmos-curate.Dockerfile b/cosmos_curate/docker/cosmos-curate.Dockerfile new file mode 100644 index 0000000..779b449 --- /dev/null +++ b/cosmos_curate/docker/cosmos-curate.Dockerfile @@ -0,0 +1,171 @@ +# Dockerfile template for cosmos-curate +# +# The dockerfile is templated so that we can provide different conda env information. +# Docs on docker best practices: +# - https://linuxhandbook.com/dockerize-python-apps/ +# - https://uwekorn.com/2021/03/01/deploying-conda-environments-in-docker-how-to-do-it-right.html +# - https://cloud.google.com/architecture/best-practices-for-building-containers + +ARG DEBIAN_FRONTEND=noninteractive + +FROM nvcr.io/nvidia/cuda:12.9.1-devel-ubuntu24.04 AS main + +SHELL ["/bin/bash", "-c"] +ENV NVIDIA_DRIVER_CAPABILITIES=compute,video,utility +ENV TZ=America/Los_Angeles +# Get system level packages +RUN apt-get update \ + && apt-get install -y \ + # Needed for opencv + libsm6 libxext6 \ + # Needed because the certs age out sometimes? + ca-certificates \ + # Needed for installing pixi \ + wget \ + # Needed for pip install \ + git \ + # Needed for cuda profiling \ + nsight-systems-2025.3.2 \ + --option=Dpkg::Options::=--force-confdef \ + # Needed to copy model weights using rsync + rsync \ + && update-ca-certificates \ + && apt-get clean \ + && rm -rf /var/lib/apt/lists/* + +# GPU-accelerated ffmpeg (also needed for opencv) +ENV FFMPEG_VERSION=7.0.1 \ + NVCODEC_VERSION=12.1.14.0 +RUN mkdir -p /tmp && chmod 1777 /tmp && \ + apt-get update && \ + apt-get install -y \ + libcrypt-dev \ + autoconf \ + automake \ + build-essential \ + cmake \ + libaom-dev \ + libass-dev \ + libdav1d-dev \ + libdrm-dev \ + libfreetype6-dev \ + libgnutls28-dev \ + libnuma-dev \ + libopenh264-dev \ + libtool \ + libva-dev \ + libvorbis-dev \ + libvpx-dev \ + libwebp-dev \ + pkg-config \ + texinfo \ + vainfo \ + yasm \ + zlib1g-dev && \ + wget -O /tmp/nv-codec-headers.tar.gz https://github.com/FFmpeg/nv-codec-headers/releases/download/n${NVCODEC_VERSION}/nv-codec-headers-${NVCODEC_VERSION}.tar.gz && \ + tar xzvf /tmp/nv-codec-headers.tar.gz -C /tmp/ && \ + cd /tmp/nv-codec-headers-${NVCODEC_VERSION} && \ + make && \ + make install && \ + wget -O /tmp/ffmpeg-snapshot.tar.bz2 https://www.ffmpeg.org/releases/ffmpeg-${FFMPEG_VERSION}.tar.bz2 && \ + tar xjvf /tmp/ffmpeg-snapshot.tar.bz2 -C /tmp/ && \ + cd /tmp/ffmpeg-${FFMPEG_VERSION} && \ + PATH="/usr/local/cuda/bin:$PATH" \ + ./configure \ + --prefix=/usr/local \ + --enable-nonfree \ + --enable-cuda-nvcc \ + --enable-libnpp \ + --enable-libopenh264 \ + --enable-libaom \ + --enable-libdav1d \ + --enable-libvorbis \ + --enable-libvpx \ + --enable-libwebp \ + --enable-vaapi \ + --extra-cflags=-I/usr/local/cuda/include \ + --extra-ldflags=-L/usr/local/cuda/lib64 \ + --extra-libs=-lpthread \ + --extra-libs=-lm \ + --disable-static \ + --enable-shared \ + --disable-doc \ + --disable-debug && \ + make -j$(nproc) && \ + make install && \ + ldconfig && \ + # Clean up + cd / && \ + rm -rf /tmp/ffmpeg* && \ + rm -rf /tmp/nv-codec-headers* && \ + apt-get clean && rm -rf /var/lib/apt/lists/* + +# Install pixi +RUN wget -qO- https://pixi.sh/install.sh | PIXI_HOME=/usr/local PIXI_NO_PATH_UPDATE=1 sh + +# Common ENV variables needed by some ML libs +ENV AM_I_DOCKER=True \ + BUILD_WITH_CUDA=True \ + TORCH_CUDA_ARCH_LIST="8.0;8.6;9.0;10.0+PTX" \ + CUDA_HOME="/usr/local/cuda" \ + XFORMERS_IGNORE_FLASH_VERSION_CHECK="1" \ + VLLM_WORKER_MULTIPROC_METHOD="spawn" \ + VLLM_USE_V1="1" + +# Disable Ray log dedup +ENV RAY_DEDUP_LOGS=0 \ + RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1 \ + RAY_MAX_LIMIT_FROM_API_SERVER=40000 \ + RAY_MAX_LIMIT_FROM_DATA_SOURCE=40000 \ + RAY_DEFAULT_OBJECT_STORE_MAX_MEMORY_BYTES=800000000000 \ + RAY_DEFAULT_OBJECT_STORE_MEMORY_PROPORTION=0.4 \ + RAY_gcs_rpc_server_connect_timeout_s=30 \ + RAY_gcs_rpc_server_reconnect_timeout_s=180 \ + RAY_WARN_BLOCKING_GET_INSIDE_ASYNC=0 \ + XENNA_RAY_METRICS_PORT=9002 + +# boto3 & pbss +ENV AWS_REQUEST_CHECKSUM_CALCULATION='when_required' + +# Set a bunch of env vars so that we cache weights in a workspace +ENV DEFAULT_WORKSPACE_LOC="/config/default_workspace" +ENV HF_HOME="${DEFAULT_WORKSPACE_LOC}/weights/hf_home/" \ + LAION_CACHE_HOME="${DEFAULT_WORKSPACE_LOC}/weights/laion_cache/" + +# Set up pixi environments +COPY pixi.toml pixi.lock /opt/cosmos-curate/ +# If we install all the environments in a single layer, it's over 20GB and will cause slurm/NVCF to timeout pulling the +# layer. Since the cuml environment is large and needs non-overlapping RAPIDS packages, we install it separately. +RUN cd /opt/cosmos-curate && \ + export CONDA_OVERRIDE_CUDA=12.9.1 && \ + pixi install -e default -e legacy-transformers -e model-download -e transformers -e unified --frozen && \ + pixi clean cache -y + +# Install the cuml environment separately if requested. + + +# Run any hacky post-install script for each environment +COPY package/cosmos_curate/envs/ /tmp/cosmos_curate_build_envs + + +# For cosmos-xenna development +# For every environment, uninstall cosmos-xenna and then reinstall from local build. + + +# Copy the video pipeline code +COPY cosmos_curate /opt/cosmos-curate/cosmos_curate +COPY tests /opt/cosmos-curate/tests +COPY pytest.ini .coveragerc /opt/cosmos-curate/ + +# Copy additional code paths into the container + + +# Debug env vars +# ENV PYTHON_LOG=debug RUST_LOG=debug VLLM_LOG_LEVEL=DEBUG + +# Expose port for FastAPI & Ray +EXPOSE 8000 6379 + +WORKDIR /opt/cosmos-curate + +CMD ["pixi", "run", "python", "cosmos_curate/scripts/onto_nvcf.py", "--helm", "True"] \ No newline at end of file diff --git a/cosmos_curate/docker/generate_dockerfile.sh b/cosmos_curate/docker/generate_dockerfile.sh new file mode 100755 index 0000000..9d0dd42 --- /dev/null +++ b/cosmos_curate/docker/generate_dockerfile.sh @@ -0,0 +1,11 @@ +# drop cuml from default envs built +CWD=$(pwd) +REPO_ROOT=$HOME/git/cosmos-curate +cd $REPO_ROOT +cosmos-curate image build \ + --curator-path "${REPO_ROOT}" \ + --image-name cosmos-curate \ + --image-tag 1 \ + --dry-run \ + --envs legacy-transformers,transformers,unified \ + --dockerfile-output-path "${CWD}/cosmos-curate.Dockerfile" diff --git a/cosmos_curate/docker/push_anyscale.sh b/cosmos_curate/docker/push_anyscale.sh new file mode 100755 index 0000000..847bd90 --- /dev/null +++ b/cosmos_curate/docker/push_anyscale.sh @@ -0,0 +1,22 @@ +TAG=${1:-1} +REGISTRY=${2:-aws} +IMAGE=anyscale-cosmos-curate +SRC=${IMAGE}:${TAG} + +if [ "$REGISTRY" = "aws" ]; then + AWS_ACCOUNT=367974485317 + AWS_REGION=us-west-2 + AWS_REPO=wagner-west-2 + DST_BASE=${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/${IMAGE} + aws ecr get-login-password --region ${AWS_REGION} | docker login --username AWS --password-stdin ${AWS_ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com +else + PROJECT_ID=troubleshootingorg-gcp-pub + REGION=us-central1 + REPO=wagner-docker + DST_BASE=${REGION}-docker.pkg.dev/${PROJECT_ID}/${REPO}/${IMAGE} +fi + +docker tag ${SRC} ${DST_BASE}:${TAG} +docker push ${DST_BASE}:${TAG} +docker tag ${SRC} ${DST_BASE}:latest +docker push ${DST_BASE}:latest diff --git a/cosmos_curate/hello_world.yaml b/cosmos_curate/hello_world.yaml new file mode 100644 index 0000000..59501f0 --- /dev/null +++ b/cosmos_curate/hello_world.yaml @@ -0,0 +1,22 @@ +name: cosmos-curate-hello-world +image_uri: 367974485317.dkr.ecr.us-west-2.amazonaws.com/anyscale-cosmos-curate:6 +ray_version: 2.48.0 +entrypoint: > + python all_nodes_init_script.py gpt2 + && pixi run python -m cosmos_curate.pipelines.examples.hello_world_pipeline +py_modules: ["./cosmos-curate"] +compute_config: + head_node: + instance_type: m5.2xlarge + resources: + CPU: 8 + GPU: 0 + flags: {} + worker_nodes: + - instance_type: g4dn.xlarge + flags: {} + min_nodes: 1 + max_nodes: 1 + market_type: ON_DEMAND +working_dir: "." +max_retries: 0 diff --git a/cosmos_curate/model_download_worker.py b/cosmos_curate/model_download_worker.py new file mode 100644 index 0000000..6db7ee7 --- /dev/null +++ b/cosmos_curate/model_download_worker.py @@ -0,0 +1,24 @@ +import ray +from ray.util.scheduling_strategies import NodeAffinitySchedulingStrategy +import cosmos_curate + +@ray.remote(runtime_env={"py_executable": "pixi run -e model-download python", "excludes": ["./pixi/"], "env_vars": {"PIXI_PROJECT_MANIFEST": "/opt/cosmos-curate/pixi.toml"}}) +def download_model(): + # Only works from /opt/cosmos-curate. Not importable otherwise from package. + import cosmos_curate.core.managers.model_cli as cli + cli.main(["download", "--models", "qwen2.5_vl,transnetv2,internvideo2_mm,bert"]) + +if __name__ == "__main__": + ray.init(runtime_env={"env_vars": {"PIXI_PROJECT_MANIFEST": "/opt/cosmos-curate/pixi.toml"}, "py_modules": [cosmos_curate]}) + refs = [] + for n in ray.nodes(): + if not n["Alive"]: + continue + ref = ( + download_model + .options(scheduling_strategy=NodeAffinitySchedulingStrategy(node_id=n["NodeID"], soft=False)) + .remote() + ) + refs.append(ref) + ray.get(refs) + diff --git a/cosmos_curate/reference_pipeline.yaml b/cosmos_curate/reference_pipeline.yaml new file mode 100644 index 0000000..b28602a --- /dev/null +++ b/cosmos_curate/reference_pipeline.yaml @@ -0,0 +1,24 @@ +name: cosmos-curate-reference-pipeline +image_uri: 367974485317.dkr.ecr.us-west-2.amazonaws.com/anyscale-cosmos-curate:6 +ray_version: 2.48.0 +entrypoint: > + python all_nodes_init_script.py qwen2.5_vl,transnetv2,internvideo2_mm,bert + && pixi run python -m cosmos_curate.pipelines.video.run_pipeline split + --input-video-path "s3://ray-example-data/videos/Hollywood2-actions-videos/Hollywood2/AVIClips/" + --output-clip-path "/mnt/user_storage/output_clips/" +py_modules: ["./cosmos-curate"] +compute_config: + head_node: + instance_type: m5.2xlarge + resources: + CPU: 8 + GPU: 0 + flags: {} + worker_nodes: + - instance_type: g6e.4xlarge + flags: {} + min_nodes: 4 + max_nodes: 4 + market_type: ON_DEMAND +working_dir: "." +max_retries: 0 diff --git a/cosmos_curate/write_s3_creds_file.sh b/cosmos_curate/write_s3_creds_file.sh new file mode 100755 index 0000000..b0d6285 --- /dev/null +++ b/cosmos_curate/write_s3_creds_file.sh @@ -0,0 +1,7 @@ +eval $(aws configure export-credentials --format env) +cat > /dev/shm/s3_creds_file <