[train] Add per-token hard masking for off-policy correction by tyler-griggs · Pull Request #1264 · NovaSky-AI/SkyRL

tyler-griggs · 2026-03-03T18:08:14Z

Summary

Zeros individual divergent tokens where train/infer IS ratio exits configurable bounds
Unlike outlier masking (rejects entire sequences), this masks only the specific tokens
Configure with off_policy_correction.token_mask_eps_low/high
For full IS-corrected masking, combine with tis_ratio_type: "token"

Test plan

Existing off-policy correction tests pass
In-bounds tokens kept, divergent tokens zeroed

🤖 Generated with Claude Code

Zeros individual tokens where the train/infer importance ratio falls outside configurable bounds, while keeping the rest of the sequence. Unlike outlier_token_mask (which rejects entire sequences), this surgically removes only the divergent tokens. Configure with: off_policy_correction.token_mask_eps_low: 0.2 # lower bound = 0.8 off_policy_correction.token_mask_eps_high: 0.28 # upper bound = 1.28 For full IS-corrected masking, combine with tis_ratio_type: "token". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vercel bot deployed to Preview March 3, 2026 18:28 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Add per-token hard masking for off-policy correction#1264

[train] Add per-token hard masking for off-policy correction#1264
tyler-griggs wants to merge 1 commit intomainfrom
tgriggs/per-token-masking

tyler-griggs commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tyler-griggs commented Mar 3, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant