Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Sharpness-Guided Group Relative Policy Optimization via Probability Shaping

About

Reinforcement learning with verifiable rewards (RLVR) has become a practical route to improve large language model reasoning, and Group Relative Policy Optimization (GRPO) is a widely used optimizer in this setting. However, RLVR training is typically performed with limited control over generalization. We revisit GRPO through a robustness-based generalization view, where the generalization loss is upper bounded by a combination of the empirical loss and a sharpness surrogate measured by the gradient norm. Building on this perspective, we propose Sharpness-Guided GRPO (GRPO-SG), a simple token-weighted variant of GRPO that downweights tokens likely to cause overly large gradients, reducing sharp updates and stabilizing optimization, thereby improving generalization. Experiments across mathematical reasoning, logic puzzles and tool-augmented question answering show consistent improvements over GRPO, along with smoother gradient-norm trajectories, supporting GRPO-SG as a simple and effective generalization-oriented upgrade to GRPO for RLVR.

Tue Le, Linh Ngo Van, Trung Le• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH 500
Accuracy (Acc)79.4
543
Mathematical ReasoningOlympiad Bench
Accuracy40.12
222
Mathematical ReasoningMinerva
Accuracy (Acc)32.35
146
Mathematical ReasoningMathematical Reasoning Aggregate
Average Score43.04
37
Question AnsweringNQ, TriviaQA, PopQA, HotpotQA, 2wiki, MuSiQue, Bamboogle
NQ Score48.23
22
Logic PuzzlesK&K Logic Puzzles
Accuracy (Level 3)95
15
Logical reasoningK&K Logic Puzzles
Accuracy (Level 3)95
12
Mathematical ReasoningAIME
Average Score (@16)18.13
11
Showing 8 of 8 rows

Other info

Follow for update