[train] Add KL-in-advantages mode by tyler-griggs · Pull Request #1262 · NovaSky-AI/SkyRL

tyler-griggs · 2026-03-03T18:08:07Z

Summary

New KL penalty mode: batch-centered relative KL applied to advantages after group normalization
advantage += coef * (avg_batch_KL - token_KL)
Tokens drifting more than average get penalized; tokens drifting less get a bonus
Configure with use_kl_in_advantages: true, kl_advantages_coef: 0.01

Test plan

Existing trainer tests pass
KL advantage sums to ~0 (batch-centered)

🤖 Generated with Claude Code

Adds a new KL penalty mode that modifies advantages after group normalization with a batch-centered relative KL signal: advantage += coef * (avg_batch_KL - token_KL) Tokens drifting more than the batch average from the reference get penalized; tokens drifting less get a bonus. The sum is approximately zero (variance-reducing). Avoids the gradient bias of KL-in-loss. Configure with use_kl_in_advantages: true, kl_advantages_coef: 0.01. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vercel bot deployed to Preview March 3, 2026 18:11 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Add KL-in-advantages mode#1262

[train] Add KL-in-advantages mode#1262
tyler-griggs wants to merge 1 commit intomainfrom
tgriggs/kl-in-advantages

tyler-griggs commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tyler-griggs commented Mar 3, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant