Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reward Models Are Secretly Value Functions: Temporally Coherent Reward Modeling

About

Reward models in RLHF are trained to score only the final token of a response - a choice that discards rich signal from every intermediate position and produces models whose token-level outputs are noise. We argue this is a missed opportunity: a well-trained reward model's output at any token should represent the conditional expectation of the final reward given the response so far. We introduce Temporally Coherent Reward Modeling (TCRM), which induces this property via two regularization terms on top of the standard Bradley-Terry loss, with minimizers provably equal to conditional expectations. The regularizers correspond to Monte Carlo and TD value-learning objectives, establishing a direct connection to RL value functions. TCRM requires zero changes to architecture, data, or inference, yet unlocks three capabilities from one principle: interpretable token-level reward trajectories (middle-token pairwise accuracy improved from 50% to 88.9%, final-token accuracy preserved); state-of-the-art PRM performance on ProcessBench (44.9% average F1) among models trained only on outcome data; and unified reward/value modeling in PPO, reducing peak GPU memory by 27% and step time by 19% with matching LLM quality.

Alex Nikulkov• 2026

Related benchmarks

TaskDatasetResultRank
Reward ModelingRewardBench v2 (test)
Average Score74.4
67
Reward ModelingSkywork-Reward-Preference (test)
Final Token Accuracy93.6
27
Process Reward Model AssessmentPROCESSBENCH
GSM8K Accuracy68.9
20
Process Reward ModelingProcessBench 1.0 (test)
GSM8K Score68.9
14
Preference EvaluationLLM-as-a-Judge comparison set
TCRM Better Rate33.7
1
Showing 5 of 5 rows

Other info

Follow for update