Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Unleashing Implicit Rewards: Prefix-Value Learning for Distribution-Level Optimization

About

Process reward models (PRMs) provide fine-grained supervision for reasoning, but reliable PRMs often require step annotations or heavy verification pipelines, making them costly to scale and refresh during online RL. Implicit PRMs reduce this cost by training log-likelihood-ratio rewards from trajectory-level outcome labels. However, the log-ratio is constrained only as a sequence-level aggregate during training, while inference decomposes it into token- or step-level scores for partial prefixes. This train-inference mismatch leaves local credits weakly identified, so distribution-wide scoring can amplify misleading advantages. We propose Implicit Prefix-Value Reward Model (IPVRM), which directly learns the probability of eventual correctness for each prefix from outcome labels. Step signals are then obtained as temporal-difference (TD) differences between consecutive prefix values, aligning the training target with inference-time use. IPVRM markedly improves step-verification F1 on ProcessBench. To exploit these prefix values during policy optimization, we further introduce Distribution-Level RL (DistRL), which applies TD advantages to both sampled tokens and high-probability candidate tokens, providing dense counterfactual updates without additional rollouts. Experiments show that DistRL brings limited gains with unreliable implicit rewards, but consistently improves downstream reasoning when paired with IPVRM. The implementation of our method is available at https://github.com/gaoshiping/IPVRM .

Shiping Gao, Hongzhan Chen, Xiaojun Quan, Qifan Wang, Lifu Huang• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH 500
Accuracy (Acc)70.9
543
Mathematical ReasoningAIME 2024
Accuracy8.2
479
Mathematical ReasoningMinerva Math
Accuracy38.4
233
Mathematical ReasoningOlympiad Bench
Accuracy34.1
222
Mathematical ReasoningAMC
Accuracy (ACC)37.3
215
Mathematical ReasoningAMC'23 (test)
Accuracy50
152
Mathematical ReasoningAMC23 (test)
Pass@110
61
Process-level Error LocalizationPROCESSBENCH
GSM8K Accuracy57.2
44
Mathematical ReasoningMATH 500
Pass@466.8
20
Mathematical ReasoningMinerva Math
Accuracy @425.4
20
Showing 10 of 10 rows

Other info

Follow for update