Skip to content

feat: add qos, mem, and save_strategy config fields#23

Merged
Neonkraft merged 5 commits intomainfrom
feat/slurm-qos-mem
Apr 30, 2026
Merged

feat: add qos, mem, and save_strategy config fields#23
Neonkraft merged 5 commits intomainfrom
feat/slurm-qos-mem

Conversation

@Neonkraft
Copy link
Copy Markdown
Collaborator

Summary

  • Add optional qos and mem fields to SlurmConfig so jobs can target a specific SLURM QoS class (e.g. boost_qos_dbg) and cap memory allocation without hardcoding them in the template
  • Add save_strategy: str = "steps" to CheckpointingConfig to allow epoch-based or disabled checkpointing alongside the existing save_steps
  • Wire all three through both SLURM templates and TrainingArguments

Type of change

  • Bug fix
  • New feature
  • Refactor
  • Performance
  • Documentation
  • Maintenance

Add optional qos and mem fields to SlurmConfig so jobs can specify a
SLURM QoS class and memory limit without hardcoding them in the template.
Add save_strategy to CheckpointingConfig (default "steps") to allow
epoch or disabled checkpointing. Wire all three through the SLURM
templates and TrainingArguments.
@Neonkraft Neonkraft requested a review from KonstiNik April 29, 2026 14:17
Copy link
Copy Markdown
Collaborator

@KonstiNik KonstiNik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three clean additions, all correctly wired in the TRL paths. One coverage gap before merge: the LlamaFactory render path (launcher.py:154-173) doesn't forward qos/mem, and job_llamafactory.sh.jinja is missing the matching {% if qos %} / {% if mem %} blocks. Since these are SLURM scheduler directives — the same #SBATCH header the TRL templates use — a LlamaFactory user setting slurm.qos or slurm.mem would see them silently ignored. Worth mirroring the change there for consistency.

@Neonkraft
Copy link
Copy Markdown
Collaborator Author

Done. Wrote a few tests that check the generated job script, too, and added pytest as a dev dependency in pyproject.toml.

Copy link
Copy Markdown
Collaborator

@KonstiNik KonstiNik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for covering the LlamaFactory, and having the test is an amazing addition. I'm all for having more tests in the repo.

@Neonkraft Neonkraft merged commit 1477fed into main Apr 30, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants