Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Where Hindsight Credit Can Reside: A Signed-Capacity View of Token Updates in RLVR

About

Reinforcement Learning with Verifiable Rewards (RLVR) improves the reasoning ability of Large Language Models (LLMs), but sparse outcome rewards make token-level credit assignment difficult. We study token-level credit as a reward-conditioned shift from the behavior policy to a hindsight posterior. In autoregressive RLVR, this shift can be expressed through Conditional Mutual Information (CMI), which shows that token entropy upper-bounds possible hindsight credit. Entropy, however, indicates capacity rather than update direction, so we introduce the Four Quadrant Decomposition to separate updates by reward polarity and token entropy. Controlled interventions show that these two factors jointly shape token updates. Sustained reasoning gains concentrate in signed high-entropy quadrants, whereas low-entropy updates saturate quickly. Based on this analysis, we propose Hindsight-Aware Policy Optimization (HAPO), a sign-preserving modification to GRPO that performs capacity-guided advantage reallocation. Experiments on mathematical reasoning benchmarks in two model settings show that HAPO achieves competitive performance among entropy-aware baselines.

Yuhang He, Haodong Wu, Siyi Liu, Hongyu Ge, Hange Zhou, Keyi Wu, Zhuo Zheng, Qihong Lin, Zixin Zhong, Yongqi Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAMC
Average Pass@3275.5
44
Mathematical ReasoningMATH
Pass@k (PK)88.7
31
ReasoningIn-distribution reasoning (test)
GPQA Score28.12
30
Graph ReasoningOOD graph (test)
GraphWiz Score35.81
30
Mathematical ReasoningMinerva Math
Avg@443.9
19
Mathematical ReasoningAIME 2025
Average Score (Avg@32)17.2
10
Mathematical ReasoningAggregate (AIME, AMC, MATH, Minerva, Olympiad)
Average Score48.6
10
Mathematical ReasoningOlympiadBench
Avg@4 Score45.1
10
Mathematical ReasoningAMC
Avg@3262.1
10
Mathematical ReasoningAIME 2024
Average Score (Avg@32)45.1
9
Showing 10 of 13 rows

Other info

Follow for update