Outcome-Grounded Advantage Reshaping for Fine-Grained Credit Assignment in Mathematical Reasoning

About

Group Relative Policy Optimization (GRPO) has emerged as a promising critic-free reinforcement learning paradigm for reasoning tasks. However, standard GRPO employs a coarse-grained credit assignment mechanism that propagates group-level rewards uniformly to to every token in a sequence, neglecting the varying contribution of individual reasoning steps. We address this limitation by introducing Outcome-grounded Advantage Reshaping (OAR), a fine-grained credit assignment mechanism that redistributes advantages based on how much each token influences the model's final answer. We instantiate OAR via two complementary strategies: (1) OAR-P, which estimates outcome sensitivity through counterfactual token perturbations, serving as a high-fidelity attribution signal; (2) OAR-G, which uses an input-gradient sensitivity proxy to approximate the influence signal with a single backward pass. These importance signals are integrated with a conservative Bi-Level advantage reshaping scheme that suppresses low-impact tokens and boosts pivotal ones while preserving the overall advantage mass. Empirical results on extensive mathematical reasoning benchmarks demonstrate that while OAR-P sets the performance upper bound, OAR-G achieves comparable gains with negligible computational overhead, both significantly outperforming a strong GRPO baseline, pushing the boundaries of critic-free LLM reasoning.

Ziheng Li, Liu Kang, Feng Xiao, Luxi Xing, Qingyi Si, Zhuoran Li, Weikang Gong, Deqing Yang, Yanghua Xiao, Hongcheng Guo• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	--	499
Role-playing	CharacterBench	Overall Average Score3.882	70
Role-play dialogue comprehension	SocialBench	Role Knowledge93.2	61
Role-playing	CharacterBench latest (full)	Overall Score4.45	47
Mathematical Reasoning	AMC 23	Pass@165.6	46
Mathematical Reasoning	AIME 2024	P@133.5	13
Mathematical Reasoning	MATH 500	P@183.8	13
Mathematical Reasoning	AIME 2025	P@114.5	13
Mathematical Reasoning	Mathematical Reasoning Benchmarks AIME24, AMC23, MATH500	AIME24 Score15	6

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord