Process Reward Models That Think

About

Step-by-step verifiers -- also known as process reward models (PRMs) -- are a key ingredient for test-time scaling. PRMs require step-level supervision, making them expensive to train. This work aims to build data-efficient PRMs as verbalized step-wise reward models that verify every step in the solution by generating a verification chain-of-thought (CoT). We propose ThinkPRM, a long CoT verifier fine-tuned on orders of magnitude fewer process labels than those required by discriminative PRMs. Our approach capitalizes on the inherent reasoning abilities of long CoT models, and outperforms LLM-as-a-Judge and discriminative verifiers -- using only 1% of the process labels in PRM800K -- across several challenging benchmarks. Specifically, ThinkPRM beats the baselines on ProcessBench, MATH-500, and AIME '24 under best-of-N selection and reward-guided search. In an out-of-domain evaluation on a subset of GPQA-Diamond and LiveCodeBench, our PRM surpasses discriminative verifiers trained on the full PRM800K by 8% and 4.5%, respectively. Lastly, under the same token budget, ThinkPRM scales up verification compute more effectively compared to LLM-as-a-Judge, outperforming it by 7.2% on a subset of ProcessBench. Our work highlights the value of generative, long CoT PRMs that can scale test-time compute for verification while requiring minimal supervision for training. Our code, data, and models are released at https://github.com/mukhal/thinkprm.

Muhammad Khalifa, Rishabh Agarwal, Lajanugen Logeswaran, Jaekyeom Kim, Hao Peng, Moontae Lee, Honglak Lee, Lu Wang• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AIME 2024	Accuracy10.2	394
Mathematical Reasoning	AIME 2025	Accuracy10.6	378
Interactive web-based shopping tasks	Webshop	Success Rate43	80
Scientific Agent Task Completion	ScienceAgentBench	Success Rate (SR)21.79	40
Uncertainty Quantification	τ2-bench Retail	AUROC0.67	32
Uncertainty Quantification	τ2-bench Airline	AUROC70.8	32
Data analysis step verification	DABStep	Easy Score75	30
Multi-Turn Tool Calling	BFCL MT v4	Success Rate40	20
Tool-augmented general task solving	AgentDojo	Success Rate88.7	20
Conversational agents in customer-service environments	Tau2-Airline	Success Rate64	20

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord