fix: correct environment setup in container SLURM template#22
fix: correct environment setup in container SLURM template#22
Conversation
…R in container template CUDA_DEVICE_MAX_CONNECTIONS=1 is a Megatron-LM tensor-parallel flag that serializes CUDA streams and hurts ZeRO-2 overlap_comm — removed. PYTHONPATH was cleared to an empty string, causing the container to use its baked-in post_training package instead of the local src/ checkout. Now set to repo_dir/src so local changes take effect. WANDB_DIR is set to the run directory so that offline WandB runs persist to Lustre-backed shared storage rather than being lost on node teardown.
KonstiNik
left a comment
There was a problem hiding this comment.
Three bugs, all good fixes. Two notes:
-
The
PYTHONPATH=""issue also exists in job_llamafactory.sh.jinja:73. Maybe worth considering here too. -
WANDB_DIR={{ repo_dir }}/{{ run_dir }}only works becauserun_diris relative by default. It's constructed asPath(output_base) / run_nameat paths.py:57-59, andoutput_baseis just a string (config.py:207), but nothing prevents a user from setting it to an absolute path like/scratch/foo. If they do, wandb fails to write (or writes somewhere unexpected, since that path likely isn't mounted in the container).Potential fix: compute an absolute
run_dirat the launcher boundary and pass that to the template — e.g. at launcher.py:117, passrun_dir=str(Path.cwd() / run_dir if not run_dir.is_absolute() else run_dir), then templateWANDB_DIR={{ run_dir }}directly.
WANDB_DIR was constructed as repo_dir/run_dir in job_trl_container.sh.jinja, which breaks when output_base is an absolute path. Pass run_dir.resolve() from the launcher so templates always receive an absolute path, then simplify WANDB_DIR to use run_dir directly. Same resolve() applied to the LlamaFactory renderer for consistency.
|
Fixed. |
Summary
Three correctness fixes to
job_trl_container.sh.jinja:CUDA_DEVICE_MAX_CONNECTIONS=1: this is a Megatron-LM tensor-parallel flag that serializes CUDA streams and hurts ZeRO-2'soverlap_comm— removedPYTHONPATH: was set to"", causing the container to use its baked-inpost_trainingpackage instead of the localsrc/checkout. Now set to{{ repo_dir }}/srcWANDB_DIR: set to the run directory so offline WandB runs persist to Lustre-backed shared storage rather than being lost when the compute node is releasedType of change