Where Hindsight Credit Can Reside: A Signed-Capacity View of Token Updates in RLVR
About
Reinforcement Learning with Verifiable Rewards (RLVR) improves the reasoning ability of Large Language Models (LLMs), but sparse outcome rewards make token-level credit assignment difficult. We study token-level credit as a reward-conditioned shift from the behavior policy to a hindsight posterior. In autoregressive RLVR, this shift can be expressed through Conditional Mutual Information (CMI), which shows that token entropy upper-bounds possible hindsight credit. Entropy, however, indicates capacity rather than update direction, so we introduce the Four Quadrant Decomposition to separate updates by reward polarity and token entropy. Controlled interventions show that these two factors jointly shape token updates. Sustained reasoning gains concentrate in signed high-entropy quadrants, whereas low-entropy updates saturate quickly. Based on this analysis, we propose Hindsight-Aware Policy Optimization (HAPO), a sign-preserving modification to GRPO that performs capacity-guided advantage reallocation. Experiments on mathematical reasoning benchmarks in two model settings show that HAPO achieves competitive performance among entropy-aware baselines.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | AMC | Average Pass@3275.5 | 44 | |
| Mathematical Reasoning | MATH | Pass@k (PK)88.7 | 31 | |
| Reasoning | In-distribution reasoning (test) | GPQA Score28.12 | 30 | |
| Graph Reasoning | OOD graph (test) | GraphWiz Score35.81 | 30 | |
| Mathematical Reasoning | Minerva Math | Avg@443.9 | 19 | |
| Mathematical Reasoning | AIME 2025 | Average Score (Avg@32)17.2 | 10 | |
| Mathematical Reasoning | Aggregate (AIME, AMC, MATH, Minerva, Olympiad) | Average Score48.6 | 10 | |
| Mathematical Reasoning | OlympiadBench | Avg@4 Score45.1 | 10 | |
| Mathematical Reasoning | AMC | Avg@3262.1 | 10 | |
| Mathematical Reasoning | AIME 2024 | Average Score (Avg@32)45.1 | 9 |