Where Hindsight Credit Can Reside: A Signed-Capacity View of Token Updates in RLVR

About

Reinforcement Learning with Verifiable Rewards (RLVR) improves the reasoning ability of Large Language Models (LLMs), but sparse outcome rewards make token-level credit assignment difficult. We study token-level credit as a reward-conditioned shift from the behavior policy to a hindsight posterior. In autoregressive RLVR, this shift can be expressed through Conditional Mutual Information (CMI), which shows that token entropy upper-bounds possible hindsight credit. Entropy, however, indicates capacity rather than update direction, so we introduce the Four Quadrant Decomposition to separate updates by reward polarity and token entropy. Controlled interventions show that these two factors jointly shape token updates. Sustained reasoning gains concentrate in signed high-entropy quadrants, whereas low-entropy updates saturate quickly. Based on this analysis, we propose Hindsight-Aware Policy Optimization (HAPO), a sign-preserving modification to GRPO that performs capacity-guided advantage reallocation. Experiments on mathematical reasoning benchmarks in two model settings show that HAPO achieves competitive performance among entropy-aware baselines.

Yuhang He, Haodong Wu, Siyi Liu, Hongyu Ge, Hange Zhou, Keyi Wu, Zhuo Zheng, Qihong Lin, Zixin Zhong, Yongqi Zhang• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	Accuracy91.1	589
Mathematical Reasoning	AIME 2024	Accuracy45.1	394
Mathematical Reasoning	AIME 2025	Accuracy30.1	378
Mathematical Reasoning	OlympiadBench	Accuracy60.8	134
Mathematical Reasoning	AMC	Average Pass@3275.5	44
Mathematical Reasoning	MATH	Pass@k (PK)88.7	31
Reasoning	In-distribution reasoning (test)	GPQA Score28.12	30
Graph Reasoning	OOD graph (test)	GraphWiz Score35.81	30
Mathematical Reasoning	Minerva Math	Avg@443.9	19
Mathematical Reasoning	AIME 2025	Average Score (Avg@32)17.2	10

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord