Skip to content

fix: prevent run directory from being wiped on job requeue #19

Merged
Neonkraft merged 2 commits intomainfrom
fix/run-dir-wipe-race
Apr 30, 2026
Merged

fix: prevent run directory from being wiped on job requeue #19
Neonkraft merged 2 commits intomainfrom
fix/run-dir-wipe-race

Conversation

@Neonkraft
Copy link
Copy Markdown
Collaborator

Summary

Gate the destructive run-directory wipe behind an allow_override parameter so train.py on compute nodes can never accidentally delete an active run directory.

Previously, setup_run_directory() called shutil.rmtree on the run dir whenever debug.override_existing=True, regardless of caller. On a requeue/resume, train.py runs setup_run_directory() again — wiping checkpoints mid-job. Fix: add allow_override: bool = False; only submit.py (the submission-time entrypoint) passes True.

Type of change

  • Bug fix
  • New feature
  • Refactor
  • Performance
  • Documentation
  • Maintenance

setup_run_directory() would delete the active run directory whenever
debug.override_existing=True, including when called from train.py on
compute nodes during a running job. Add allow_override=False default so
the rmtree only fires when submit.py explicitly opts in.
@Neonkraft Neonkraft requested a review from KonstiNik April 29, 2026 13:50
Copy link
Copy Markdown
Collaborator

@KonstiNik KonstiNik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Addition of a small logic layer to prevent submission time wipe.

@Neonkraft Neonkraft merged commit 81ba56a into main Apr 30, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants