Skip to content

[train] Add DRO (Direct Reward Optimization) policy loss#1259

Draft
tyler-griggs wants to merge 2 commits intomainfrom
tgriggs/dro-loss
Draft

[train] Add DRO (Direct Reward Optimization) policy loss#1259
tyler-griggs wants to merge 2 commits intomainfrom
tgriggs/dro-loss

Conversation

@tyler-griggs
Copy link
Member

Summary

  • New policy loss: smooth quadratic trust region via exponential tilt over PPO surrogate
  • L_dro = (1/beta) * log(E[exp(beta * L_ppo)])
  • Configure with policy_loss_type: "dro", dro.beta: 0.1

Test plan

  • Existing loss tests pass
  • Manual smoke test with DRO config

🤖 Generated with Claude Code

Ubuntu and others added 2 commits March 3, 2026 18:06
Adds a new policy loss function that wraps the PPO clipped surrogate
in an exponential tilt, focusing optimization on worst-case tokens.

    L_dro = (1/beta) * log(E[exp(beta * L_ppo)])

Configurable via policy_loss_type: "dro" with dro.beta controlling
the degree of robustness (default 0.1).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add "sum" to reduce_loss (simple masked sum, no averaging)
- Update ppo_policy_loss assertion, config validation, and docstring
  to include "sum" as a valid loss_reduction option
- Recreate test file covering only functions that exist on this branch:
  DRO loss (5 tests), sum reduction (4 tests), zero-variance filter
  (4 tests), config fields (1 test)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant