Unleashing Implicit Rewards: Prefix-Value Learning for Distribution-Level Optimization

About

Process reward models (PRMs) provide fine-grained supervision for reasoning, but reliable PRMs often require step annotations or heavy verification pipelines, making them costly to scale and refresh during online RL. Implicit PRMs reduce this cost by training log-likelihood-ratio rewards from trajectory-level outcome labels. However, the log-ratio is constrained only as a sequence-level aggregate during training, while inference decomposes it into token- or step-level scores for partial prefixes. This train-inference mismatch leaves local credits weakly identified, so distribution-wide scoring can amplify misleading advantages. We propose Implicit Prefix-Value Reward Model (IPVRM), which directly learns the probability of eventual correctness for each prefix from outcome labels. Step signals are then obtained as temporal-difference (TD) differences between consecutive prefix values, aligning the training target with inference-time use. IPVRM markedly improves step-verification F1 on ProcessBench. To exploit these prefix values during policy optimization, we further introduce Distribution-Level RL (DistRL), which applies TD advantages to both sampled tokens and high-probability candidate tokens, providing dense counterfactual updates without additional rollouts. Experiments show that DistRL brings limited gains with unreliable implicit rewards, but consistently improves downstream reasoning when paired with IPVRM. The implementation of our method is available at https://github.com/gaoshiping/IPVRM .

Shiping Gao, Hongzhan Chen, Xiaojun Quan, Qifan Wang, Lifu Huang• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	Accuracy (Acc)70.9	543
Mathematical Reasoning	AIME 2024	Accuracy8.2	479
Mathematical Reasoning	Minerva Math	Accuracy38.4	233
Mathematical Reasoning	Olympiad Bench	Accuracy34.1	222
Mathematical Reasoning	AMC	Accuracy (ACC)37.3	215
Mathematical Reasoning	AMC'23 (test)	Accuracy50	152
Mathematical Reasoning	AMC23 (test)	Pass@110	61
Process-level Error Localization	PROCESSBENCH	GSM8K Accuracy57.2	44
Mathematical Reasoning	MATH 500	Pass@466.8	20
Mathematical Reasoning	Minerva Math	Accuracy @425.4	20

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord