Skip to content

[train] Add sampler-as-reference KL option#1263

Draft
tyler-griggs wants to merge 1 commit intomainfrom
tgriggs/sampler-as-reference-kl
Draft

[train] Add sampler-as-reference KL option#1263
tyler-griggs wants to merge 1 commit intomainfrom
tgriggs/sampler-as-reference-kl

Conversation

@tyler-griggs
Copy link
Member

Summary

  • Allow kl_reference_source: "rollout" to use rollout logprobs as KL reference
  • Eliminates reference model memory and forward pass compute
  • KL constrains per-step drift rather than cumulative drift from initialization
  • Validation checks updated to skip ref model allocation when not needed

Test plan

  • Existing trainer tests pass
  • With kl_reference_source: "rollout", no ref model actors are created

🤖 Generated with Claude Code

Allow using rollout logprobs as the KL reference instead of a separate
frozen reference model, via kl_reference_source: "rollout". This
eliminates the reference model memory and forward pass compute.

With this option, KL constrains per-step policy drift (using the
sampling policy as reference) rather than cumulative drift from
initialization (using a frozen model).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant