Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning
About
Reinforcement learning significantly enhances LLM capabilities but suffers from a critical issue: length inflation, where models adopt verbosity or inefficient reasoning to maximize rewards. Prior approaches struggle to address this challenge in a general and lossless manner, primarily because additive penalties introduce a compensatory effect that creates optimization shortcuts, while heuristic gating strategies lack generality beyond binary feedback. To bridge this gap, we present Group Relative Reward Rescaling (GR$^3$), which reframes length control as a multiplicative rescaling paradigm, effectively establishing a generalized, continuous, and reward-dependent gating mechanism. To further ensure lossless optimization, we incorporate group-relative regularization and advantage-aware calibration, which dynamically adapt length budgets to instance difficulty and preserve the advantage signal of high-quality trajectories. Empirically, across both RLHF and RLVR settings, GR$^3$~maintains training dynamics and downstream performance comparable to standard GRPO while significantly mitigating length inflation, outperforming state-of-the-art length-regularized baselines.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH500 | Accuracy (Avg@4)89.3 | 10 | |
| Mathematical Reasoning | AIME 24 | Average Score (Top-32)45.2 | 7 | |
| Mathematical Reasoning | AIME 25 | Avg@32 Score32.8 | 7 | |
| Mathematical Reasoning | AMC 23 | Average Accuracy @1693 | 7 | |
| Mathematical Reasoning | MATH500 | Avg@4 Score94 | 7 | |
| Mathematical Reasoning | AMC 23 | Avg@16 Score81.6 | 7 | |
| Chat Performance | Arena-Hard-Auto | Score92.8 | 6 | |
| Chat Performance | Alpaca Eval | Score55.8 | 6 | |
| Code Generation | LiveCodeBench v6 | Score41.6 | 6 |