feat: add qos, mem, and save_strategy config fields#23
Conversation
Add optional qos and mem fields to SlurmConfig so jobs can specify a SLURM QoS class and memory limit without hardcoding them in the template. Add save_strategy to CheckpointingConfig (default "steps") to allow epoch or disabled checkpointing. Wire all three through the SLURM templates and TrainingArguments.
KonstiNik
left a comment
There was a problem hiding this comment.
Three clean additions, all correctly wired in the TRL paths. One coverage gap before merge: the LlamaFactory render path (launcher.py:154-173) doesn't forward qos/mem, and job_llamafactory.sh.jinja is missing the matching {% if qos %} / {% if mem %} blocks. Since these are SLURM scheduler directives — the same #SBATCH header the TRL templates use — a LlamaFactory user setting slurm.qos or slurm.mem would see them silently ignored. Worth mirroring the change there for consistency.
|
Done. Wrote a few tests that check the generated job script, too, and added pytest as a dev dependency in pyproject.toml. |
KonstiNik
left a comment
There was a problem hiding this comment.
Thanks for covering the LlamaFactory, and having the test is an amazing addition. I'm all for having more tests in the repo.
Summary
qosandmemfields toSlurmConfigso jobs can target a specific SLURM QoS class (e.g.boost_qos_dbg) and cap memory allocation without hardcoding them in the templatesave_strategy: str = "steps"toCheckpointingConfigto allow epoch-based or disabled checkpointing alongside the existingsave_stepsTrainingArgumentsType of change