Sharpness-Guided Group Relative Policy Optimization via Probability Shaping
About
Reinforcement learning with verifiable rewards (RLVR) has become a practical route to improve large language model reasoning, and Group Relative Policy Optimization (GRPO) is a widely used optimizer in this setting. However, RLVR training is typically performed with limited control over generalization. We revisit GRPO through a robustness-based generalization view, where the generalization loss is upper bounded by a combination of the empirical loss and a sharpness surrogate measured by the gradient norm. Building on this perspective, we propose Sharpness-Guided GRPO (GRPO-SG), a simple token-weighted variant of GRPO that downweights tokens likely to cause overly large gradients, reducing sharp updates and stabilizing optimization, thereby improving generalization. Experiments across mathematical reasoning, logic puzzles and tool-augmented question answering show consistent improvements over GRPO, along with smoother gradient-norm trajectories, supporting GRPO-SG as a simple and effective generalization-oriented upgrade to GRPO for RLVR.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH 500 | Accuracy (Acc)79.4 | 543 | |
| Mathematical Reasoning | Olympiad Bench | Accuracy40.12 | 222 | |
| Mathematical Reasoning | Minerva | Accuracy (Acc)32.35 | 146 | |
| Mathematical Reasoning | Mathematical Reasoning Aggregate | Average Score43.04 | 37 | |
| Question Answering | NQ, TriviaQA, PopQA, HotpotQA, 2wiki, MuSiQue, Bamboogle | NQ Score48.23 | 22 | |
| Logic Puzzles | K&K Logic Puzzles | Accuracy (Level 3)95 | 15 | |
| Logical reasoning | K&K Logic Puzzles | Accuracy (Level 3)95 | 12 | |
| Mathematical Reasoning | AIME | Average Score (@16)18.13 | 11 |