feat: add Experiment.export() to write runnable scripts without submitting by ko3n1g · Pull Request #466 · NVIDIA-NeMo/Run

ko3n1g · 2026-03-16T10:28:46Z

Summary

Adds Experiment.export(output_dir, exist_ok=False) which writes one script per job to a self-contained directory plus a submit_all.sh launcher — no jobs are submitted
Each scheduler's _submit_dryrun() now writes its script to executor.experiment_dir when set:
- LocalExecutor → <task>.sh (executable bash)
- DockerExecutor → <task>.yaml
- SkypilotExecutor / SkypilotJobsExecutor → <task>.yaml
- LeptonExecutor → <task>.sh (executable bash)
- DGXCloudExecutor → <task>_torchrun_job.sh (was hardcoded, now uses job_name)
Experiment.export() redirects all executor experiment_dirs to output_dir, runs dryrun to trigger script writing, then generates submit_all.sh with the correct submit command per executor type (sbatch, sky launch, etc.)

Example — SLURM

with run.Experiment("my-exp") as exp:
    exp.add(
        run.Script("scripts/pretrain.sh"),
        executor=SlurmExecutor(account="my_account", partition="gpu", nodes=4, ...),
        name="pretrain",
    )
    exp.add(
        run.Script("scripts/finetune.sh"),
        executor=SlurmExecutor(account="my_account", partition="gpu", nodes=1, ...),
        name="finetune",
    )
    exp.export("/tmp/my_exp_scripts")

Output directory:

/tmp/my_exp_scripts/
├── finetune_sbatch.sh
├── pretrain_sbatch.sh
└── submit_all.sh

pretrain_sbatch.sh

#!/bin/bash
#
# Generated by NeMo Run
# Run with: sbatch --requeue --parsable
#

# Parameters
#SBATCH --account=my_account
#SBATCH --gpus-per-node=8
#SBATCH --job-name=my_account-account.pretrain
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --open-mode=append
#SBATCH --output=/scratch/myuser/nemo_jobs/my-exp/my-exp_.../pretrain/sbatch_my_account-account.pretrain_%j.out
#SBATCH --partition=gpu
#SBATCH --time=04:00:00

set -evx

export PYTHONUNBUFFERED=1
export SLURM_UNBUFFEREDIO=1
export TORCHX_MAX_RETRIES=0

set +e

# setup

nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)

# Command 1

srun \
  --output .../pretrain/log-my_account-account.pretrain_%j_${SLURM_RESTART_COUNT:-0}.out \
  --container-image nvcr.io/nvidia/nemo:latest \
  --container-mounts .../pretrain:/nemo_run \
  --container-workdir /nemo_run/code \
  --wait=60 --kill-on-bad-exit=1 \
  bash scripts/pretrain.sh

exitcode=$?

set -e

echo "job exited with code $exitcode"
if [ $exitcode -ne 0 ]; then
    if [ "$TORCHX_MAX_RETRIES" -gt "${SLURM_RESTART_COUNT:-0}" ]; then
        scontrol requeue "$SLURM_JOB_ID"
    fi
    exit $exitcode
fi

submit_all.sh

#!/bin/bash
# Submit all jobs for experiment: my-exp
# Generated by NeMo Run

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

sbatch "$SCRIPT_DIR/pretrain_sbatch.sh"
sbatch "$SCRIPT_DIR/finetune_sbatch.sh"

Test plan

uv run -- pytest test/run/test_experiment.py -k "export" -v — 4 new tests pass
uv run -- pytest test/run/test_experiment.py — all 66 tests pass
ruff check --fix . && ruff format . — clean

🤖 Generated with Claude Code

…tting Adds `Experiment.export(output_dir)` which writes one script per job into a self-contained directory plus a `submit_all.sh` launcher, enabling users to inspect, version, and manually submit jobs without going through the NeMo Run execution pipeline. Each scheduler's `_submit_dryrun()` now writes its script to `executor.experiment_dir` when set: - LocalExecutor → <task>.sh (executable bash) - DockerExecutor → <task>.yaml - SkypilotExecutor / SkypilotJobsExecutor → <task>.yaml - LeptonExecutor → <task>.sh (executable bash) - DGXCloudExecutor → <task>_torchrun_job.sh (was hardcoded, now uses job_name) `Experiment.export()` redirects all executor experiment_dirs to output_dir, runs dryrun to trigger script writing, then generates submit_all.sh with the correct submit command per executor type (sbatch, sky launch, etc.). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

nemo_run/run/torchx_backend/schedulers/lepton.py

nemo_run/run/torchx_backend/schedulers/local.py

…t tasks Demonstrates Experiment.export() across all common executor types: - export_local.py — single LocalExecutor job → .sh + submit_all.sh - export_multi_job.py — three-job pipeline (preprocess/train/evaluate) - export_script.py — run.Script (inline bash) tasks - export_slurm.py — two SlurmExecutor jobs → *_sbatch.sh; no cluster needed - export_dgxcloud.py — DGXCloudExecutor job → *_torchrun_job.sh; no API calls needed Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

…nd Script tasks" This reverts commit 9a6c46c. Signed-off-by: oliver könig <okoenig@nvidia.com>

Addresses CodeQL finding: overly permissive chmod 0o755 made generated scripts world-readable/executable. Changed to 0o750 in local.py, lepton.py, and the submit_all.sh writer in experiment.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

nemo_run/run/torchx_backend/schedulers/lepton.py

nemo_run/run/torchx_backend/schedulers/local.py

_submit_dryrun() was unconditionally writing torchrun_job.sh, crashing with AttributeError when job_name is unset (executor not yet assigned to an experiment). Guard with `if executor.experiment_dir:` consistent with all other schedulers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

- Change chmod from 0o750 to 0o700 in local.py, lepton.py, and experiment.py to silence CodeQL "overly permissive" findings - Fix _write_submit_script to handle JobGroup (uses job.executors, not the nonexistent job.jobs attribute) - Add test_lepton.py (new file) covering create_scheduler, _submit_dryrun, file write, and no-write-without-experiment_dir - Add test_submit_dryrun_writes_script/yaml to dgxcloud, docker, local, skypilot, and skypilot_jobs scheduler tests - Add test_experiment_export_job_group covering the JobGroup branch in Experiment.export() - Fix missing import os in test_dgxcloud.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

test/run/torchx_backend/schedulers/test_lepton.py

test/run/torchx_backend/schedulers/test_local.py

Fixes code-scanning alerts 564 and 565 ("File is not always closed"): - test_lepton.py: use `with open(script) as f` instead of bare open() - test_local.py: same fix Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>

github-advanced-security bot found potential problems Mar 16, 2026

View reviewed changes

nemo_run/run/torchx_backend/schedulers/lepton.py Fixed Show fixed Hide fixed

nemo_run/run/torchx_backend/schedulers/local.py Fixed Show fixed Hide fixed

ko3n1g temporarily deployed to public March 16, 2026 10:30 — with GitHub Actions Inactive

ko3n1g and others added 2 commits March 16, 2026 10:35

Revert "docs: add export() e2e examples for Local, SLURM, DGXCloud, a…

ddd80e6

…nd Script tasks" This reverts commit 9a6c46c. Signed-off-by: oliver könig <okoenig@nvidia.com>

ko3n1g temporarily deployed to public March 16, 2026 10:37 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public March 16, 2026 10:39 — with GitHub Actions Inactive

ko3n1g requested review from hemildesai and malay-nagda March 16, 2026 10:42

github-advanced-security bot found potential problems Mar 16, 2026

View reviewed changes

nemo_run/run/torchx_backend/schedulers/lepton.py Fixed Show fixed Hide fixed

nemo_run/run/torchx_backend/schedulers/local.py Fixed Show fixed Hide fixed

ko3n1g temporarily deployed to public March 16, 2026 10:45 — with GitHub Actions Inactive

ko3n1g marked this pull request as ready for review March 16, 2026 10:47

ko3n1g temporarily deployed to public March 16, 2026 10:48 — with GitHub Actions Inactive

github-advanced-security bot found potential problems Mar 16, 2026

View reviewed changes

test/run/torchx_backend/schedulers/test_lepton.py Fixed Show fixed Hide fixed

test/run/torchx_backend/schedulers/test_local.py Fixed Show fixed Hide fixed

ko3n1g temporarily deployed to public March 16, 2026 11:09 — with GitHub Actions Inactive

ko3n1g temporarily deployed to public March 16, 2026 11:12 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Experiment.export() to write runnable scripts without submitting#466

feat: add Experiment.export() to write runnable scripts without submitting#466
ko3n1g wants to merge 7 commits intomainfrom
ko3n1g/feat/export-slurm-scripts

ko3n1g commented Mar 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ko3n1g commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Example — SLURM

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ko3n1g commented Mar 16, 2026 •

edited

Loading