We currently support a few loss reduction methods:
sequence_mean: GRPO
token_mean: DAPO
seq_mean_token_sum_norm: Dr. GRPO
This paper proposes a different way ("VL Norm") to reduce the loss that "provides an unbiased estimate of the true policy loss but also minimizes gradient variance"
cc @erictang000

We currently support a few loss reduction methods:
sequence_mean: GRPOtoken_mean: DAPOseq_mean_token_sum_norm: Dr. GRPOThis paper proposes a different way ("VL Norm") to reduce the loss that "provides an unbiased estimate of the true policy loss but also minimizes gradient variance"
cc @erictang000