[train] Add DRO (Direct Reward Optimization) policy loss by tyler-griggs · Pull Request #1259 · NovaSky-AI/SkyRL

tyler-griggs · 2026-03-03T18:07:54Z

Summary

New policy loss: smooth quadratic trust region via exponential tilt over PPO surrogate
L_dro = (1/beta) * log(E[exp(beta * L_ppo)])
Configure with policy_loss_type: "dro", dro.beta: 0.1

Test plan

Existing loss tests pass
Manual smoke test with DRO config

🤖 Generated with Claude Code

Adds a new policy loss function that wraps the PPO clipped surrogate in an exponential tilt, focusing optimization on worst-case tokens. L_dro = (1/beta) * log(E[exp(beta * L_ppo)]) Configurable via policy_loss_type: "dro" with dro.beta controlling the degree of robustness (default 0.1). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add "sum" to reduce_loss (simple masked sum, no averaging) - Update ppo_policy_loss assertion, config validation, and docstring to include "sum" as a valid loss_reduction option - Recreate test file covering only functions that exist on this branch: DRO loss (5 tests), sum reduction (4 tests), zero-variance filter (4 tests), config fields (1 test) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Ubuntu and others added 2 commits March 3, 2026 18:06

vercel bot deployed to Preview March 3, 2026 18:31 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Add DRO (Direct Reward Optimization) policy loss#1259

[train] Add DRO (Direct Reward Optimization) policy loss#1259
tyler-griggs wants to merge 2 commits intomainfrom
tgriggs/dro-loss

tyler-griggs commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tyler-griggs commented Mar 3, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant