Add configurable data_seed independent of training seed#112
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests.
🚀 New features to boost your workflow:
|
Naeemkh
left a comment
There was a problem hiding this comment.
- Looks good to me. I am curious if you can change the seed mid-training. For example, if you train with
data_seed=42to step 5000, save checkpoints, then edit config todata_seed=32, resume. What happens? What should a user expect? - Are you planning on changing data trajectory mid-run?
Also, I think we need to keep the backward compatibility. The current fallback seems right.
No; I am curious what would be use cases in that scenario? |
Since there is not that much active experiment yet; we may want change all the configs two have both seed and data_seed (even with the same value) then we don't have to have this fallback; but I still fine with keeping the fallback. |
Let's do that in a separate PR. |
Summary
TrainConfig.data_seed(int | None, defaultNone) so the data-shuffling seed can be set independently of the parameter-initseed.TrainConfig.effective_data_seedproperty — returnsdata_seedwhen set, else falls back toseed, so existing configs reproduce their current trajectory byte-for-byte (willing to remove the backward compatibility).scripts/train.py(VLM, mixture, mmap, HF-streaming, HF-eager) througheffective_data_seed. Eval samplers keep usingseed._set_seed, it is deliberately not rank-offset).Enables independently varying parameter initialization and data order, while keeping single-seed runs identical.
Testing
uv run ruff check kempnerforge/ tests/passesuv run ruff format --check kempnerforge/ tests/ scripts/passesuv run pyright kempnerforge/passes (0 errors)uv run pytest tests/unit/ -v --timeout=60passesuv run torchrun --nproc_per_node=4 -m pytest tests/distributed/ -vuv run pytest tests/e2e/ --e2e -vCloses #110