Add configurable data_seed independent of training seed by amazloumi · Pull Request #112 · KempnerInstitute/KempnerForge

amazloumi · 2026-05-26T23:54:16Z

Summary

Add TrainConfig.data_seed (int | None, default None) so the data-shuffling seed can be set independently of the parameter-init seed.
Add TrainConfig.effective_data_seed property — returns data_seed when set, else falls back to seed, so existing configs reproduce their current trajectory byte-for-byte (willing to remove the backward compatibility).
Route the five training-data sampler/dataset sites in scripts/train.py (VLM, mixture, mmap, HF-streaming, HF-eager) through effective_data_seed. Eval samplers keep using seed.
The data seed is passed identically to all data-parallel ranks (unchanged: a consistent global shuffle is partitioned per rank; unlike the init seed in _set_seed, it is deliberately not rank-offset).
Add unit tests for the fallback and override semantics.

Enables independently varying parameter initialization and data order, while keeping single-seed runs identical.

Testing

uv run ruff check kempnerforge/ tests/ passes
uv run ruff format --check kempnerforge/ tests/ scripts/ passes
uv run pyright kempnerforge/ passes (0 errors)
uv run pytest tests/unit/ -v --timeout=60 passes
If distributed code changed: uv run torchrun --nproc_per_node=4 -m pytest tests/distributed/ -v
If training loop / parallelism / optimizers changed: uv run pytest tests/e2e/ --e2e -v

Closes #110

codecov · 2026-05-26T23:57:36Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines	Coverage Δ
kempnerforge/config/training.py	`87.23% <100.00%> (+1.18%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Naeemkh

Looks good to me. I am curious if you can change the seed mid-training. For example, if you train with data_seed=42 to step 5000, save checkpoints, then edit config to data_seed=32, resume. What happens? What should a user expect?
Are you planning on changing data trajectory mid-run?

Also, I think we need to keep the backward compatibility. The current fallback seems right.

amazloumi · 2026-05-27T13:47:44Z

Are you planning on changing data trajectory mid-run?

No; I am curious what would be use cases in that scenario?

amazloumi · 2026-05-27T13:50:46Z

Also, I think we need to keep the backward compatibility. The current fallback seems right.

Since there is not that much active experiment yet; we may want change all the configs two have both seed and data_seed (even with the same value) then we don't have to have this fallback; but I still fine with keeping the fallback.

Naeemkh · 2026-05-27T13:58:45Z

Also, I think we need to keep the backward compatibility. The current fallback seems right.

Since there is not that much active experiment yet; we may want change all the configs two have both seed and data_seed (even with the same value) then we don't have to have this fallback; but I still fine with keeping the fallback.

Let's do that in a separate PR.

Add configurable data_seed independent of training seed

2600a0e

amazloumi requested review from Naeemkh and mmshad May 26, 2026 23:58

Naeemkh reviewed May 27, 2026

View reviewed changes

Naeemkh approved these changes May 27, 2026

View reviewed changes

amazloumi merged commit e3cfe12 into main May 27, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add configurable data_seed independent of training seed#112

Add configurable data_seed independent of training seed#112
amazloumi merged 1 commit into
mainfrom
feat/data-seed-config

amazloumi commented May 26, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 26, 2026

Uh oh!

Naeemkh left a comment

Uh oh!

amazloumi commented May 27, 2026

Uh oh!

amazloumi commented May 27, 2026

Uh oh!

Naeemkh commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

amazloumi commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

codecov Bot commented May 26, 2026

Codecov Report

Uh oh!

Naeemkh left a comment

Choose a reason for hiding this comment

Uh oh!

amazloumi commented May 27, 2026

Uh oh!

amazloumi commented May 27, 2026

Uh oh!

Naeemkh commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

amazloumi commented May 26, 2026 •

edited

Loading