Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Unleashing Implicit Rewards: Prefix-Value Learning for Distribution-Level Optimization

About

Process reward models (PRMs) provide fine-grained reward signals along the reasoning process, but training reliable PRMs often requires step annotations or heavy verification pipelines, making them expensive to scale and refresh during online RL. Implicit PRMs mitigate this cost by learning decomposable token- or step-level rewards from trajectory-level outcome labels. However, they suffer from a train-inference mismatch: training only constrains a sequence-level aggregate, whereas inference requires token-level scores to reflect local step quality. As a result, token-level credits are weakly identified and may fail to faithfully reflect which reasoning steps are actually correct. This unreliability undermines a key promise of implicit PRMs: scoring many candidate tokens. In practice, noisy per-token advantages may systematically reinforce incorrect continuations. We address this problem with a novel Implicit Prefix-Value Reward Model (IPVRM), which directly learns a prefix-conditioned value function estimating the probability of eventual correctness, and derives step signals via temporal-difference (TD) differences. IPVRM substantially improves step-verification F1 on ProcessBench. Building on these calibrated prefix values, we further propose Distribution-Level RL (DistRL), which computes TD advantages for both sampled tokens and high-probability candidate tokens, enabling dense counterfactual updates without additional rollouts. While DistRL offers limited gains when powered by miscalibrated implicit rewards, it consistently improves downstream reasoning once paired with IPVRM.

Shiping Gao, Hongzhan Chen, Xiaojun Quan, Qifan Wang, Lifu Huang• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAMC
Accuracy (ACC)37.3
203
Mathematical ReasoningMinerva Math
Accuracy38.4
186
Mathematical ReasoningMATH 500
Accuracy (Acc)70.9
149
Mathematical ReasoningOlympiad Bench
Accuracy34.1
123
Mathematical ReasoningAIME 2024
Accuracy8.2
104
Mathematical ReasoningAMC'23 (test)
Accuracy50
60
Mathematical ReasoningAMC23 (test)
Pass@110
56
Mathematical ReasoningMATH 500
Pass@466.8
20
Mathematical ReasoningMinerva Math
Accuracy @425.4
20
Process-level Error LocalizationPROCESSBENCH
GSM8K Accuracy57.2
20
Showing 10 of 10 rows

Other info

Follow for update