Skip to content

Add configurable data_seed independent of training seed#112

Merged
amazloumi merged 1 commit into
mainfrom
feat/data-seed-config
May 27, 2026
Merged

Add configurable data_seed independent of training seed#112
amazloumi merged 1 commit into
mainfrom
feat/data-seed-config

Conversation

@amazloumi
Copy link
Copy Markdown
Member

@amazloumi amazloumi commented May 26, 2026

Summary

  • Add TrainConfig.data_seed (int | None, default None) so the data-shuffling seed can be set independently of the parameter-init seed.
  • Add TrainConfig.effective_data_seed property — returns data_seed when set, else falls back to seed, so existing configs reproduce their current trajectory byte-for-byte (willing to remove the backward compatibility).
  • Route the five training-data sampler/dataset sites in scripts/train.py (VLM, mixture, mmap, HF-streaming, HF-eager) through effective_data_seed. Eval samplers keep using seed.
  • The data seed is passed identically to all data-parallel ranks (unchanged: a consistent global shuffle is partitioned per rank; unlike the init seed in _set_seed, it is deliberately not rank-offset).
  • Add unit tests for the fallback and override semantics.

Enables independently varying parameter initialization and data order, while keeping single-seed runs identical.

Testing

  • uv run ruff check kempnerforge/ tests/ passes
  • uv run ruff format --check kempnerforge/ tests/ scripts/ passes
  • uv run pyright kempnerforge/ passes (0 errors)
  • uv run pytest tests/unit/ -v --timeout=60 passes
  • If distributed code changed: uv run torchrun --nproc_per_node=4 -m pytest tests/distributed/ -v
  • If training loop / parallelism / optimizers changed: uv run pytest tests/e2e/ --e2e -v

Closes #110

@codecov
Copy link
Copy Markdown

codecov Bot commented May 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

Files with missing lines Coverage Δ
kempnerforge/config/training.py 87.23% <100.00%> (+1.18%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@amazloumi amazloumi requested review from Naeemkh and mmshad May 26, 2026 23:58
Copy link
Copy Markdown
Member

@Naeemkh Naeemkh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Looks good to me. I am curious if you can change the seed mid-training. For example, if you train with data_seed=42 to step 5000, save checkpoints, then edit config to data_seed=32, resume. What happens? What should a user expect?
  • Are you planning on changing data trajectory mid-run?

Also, I think we need to keep the backward compatibility. The current fallback seems right.

@amazloumi
Copy link
Copy Markdown
Member Author

  • Are you planning on changing data trajectory mid-run?

No; I am curious what would be use cases in that scenario?

@amazloumi
Copy link
Copy Markdown
Member Author

Also, I think we need to keep the backward compatibility. The current fallback seems right.

Since there is not that much active experiment yet; we may want change all the configs two have both seed and data_seed (even with the same value) then we don't have to have this fallback; but I still fine with keeping the fallback.

@Naeemkh
Copy link
Copy Markdown
Member

Naeemkh commented May 27, 2026

Also, I think we need to keep the backward compatibility. The current fallback seems right.

Since there is not that much active experiment yet; we may want change all the configs two have both seed and data_seed (even with the same value) then we don't have to have this fallback; but I still fine with keeping the fallback.

Let's do that in a separate PR.

@amazloumi amazloumi merged commit e3cfe12 into main May 27, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make data-shuffling seed independently configurable from training seed

2 participants