Sharpness-Guided Group Relative Policy Optimization via Probability Shaping

About

Reinforcement learning with verifiable rewards (RLVR) has become a practical route to improve large language model reasoning, and Group Relative Policy Optimization (GRPO) is a widely used optimizer in this setting. However, RLVR training is typically performed with limited control over generalization. We revisit GRPO through a robustness-based generalization view, where the generalization loss is upper bounded by a combination of the empirical loss and a sharpness surrogate measured by the gradient norm. Building on this perspective, we propose Sharpness-Guided GRPO (GRPO-SG), a simple token-weighted variant of GRPO that downweights tokens likely to cause overly large gradients, reducing sharp updates and stabilizing optimization, thereby improving generalization. Experiments across mathematical reasoning, logic puzzles and tool-augmented question answering show consistent improvements over GRPO, along with smoother gradient-norm trajectories, supporting GRPO-SG as a simple and effective generalization-oriented upgrade to GRPO for RLVR.

Tue Le, Linh Ngo Van, Trung Le• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	Accuracy (Acc)79.4	600
Mathematical Reasoning	Olympiad Bench	Accuracy40.12	254
Mathematical Reasoning	Minerva	Accuracy (Acc)32.35	146
Question Answering	NQ, TriviaQA, PopQA, HotpotQA, 2wiki, MuSiQue, Bamboogle	Average QA Score40.18	55
Mathematical Reasoning	Mathematical Reasoning Aggregate	Average Score43.04	46
Logic Puzzles	K&K Logic Puzzles	Accuracy (Level 3)95	15
Logical reasoning	K&K Logic Puzzles	Accuracy (Level 3)95	12
Mathematical Reasoning	AIME	Average Score (@16)18.13	11

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord