Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

About

A promising approach for improving reasoning in large language models is to use process reward models (PRMs). PRMs provide feedback at each step of a multi-step reasoning trace, potentially improving credit assignment over outcome reward models (ORMs) that only provide feedback at the final step. However, collecting dense, per-step human labels is not scalable, and training PRMs from automatically-labeled data has thus far led to limited gains. To improve a base policy by running search against a PRM or using it as dense rewards for reinforcement learning (RL), we ask: "How should we design process rewards?". Our key insight is that, to be effective, the process reward for a step should measure progress: a change in the likelihood of producing a correct response in the future, before and after taking the step, corresponding to the notion of step-level advantages in RL. Crucially, this progress should be measured under a prover policy distinct from the base policy. We theoretically characterize the set of good provers and our results show that optimizing process rewards from such provers improves exploration during test-time search and online RL. In fact, our characterization shows that weak prover policies can substantially improve a stronger base policy, which we also observe empirically. We validate our claims by training process advantage verifiers (PAVs) to predict progress under such provers, and show that compared to ORMs, test-time search against PAVs is $>8\%$ more accurate, and $1.5-5\times$ more compute-efficient. Online RL with dense rewards from PAVs enables one of the first results with $5-6\times$ gain in sample efficiency, and $>6\%$ gain in accuracy, over ORMs.

Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, Aviral Kumar• 2024

Related benchmarks

Task	Dataset	Result
Process-level Evaluation	PROCESSBENCH MATH	F1 Score43.8	7
Process-level Evaluation	ProcessBench GSM8K	F1 Score51.8	7
Process-level Evaluation	ProcessBench Olympiad	F1 Score27.6	7
Process-level Evaluation	ProcessBench Average	Mean F136.6	7
Process-level Evaluation	ProcessBench Omni	F1 Score23.1	7
Step-level quality assessment	PRMBENCH	Simplicity47.16	5
Mathematical Reasoning	PRMBENCH	PRMScore49.6	4
Mathematical Reasoning	MATH 500	MATH Accuracy (BoN@8)47.2	4
Computation Cost Analysis	Reasoning Samples 10K	Time Ratio1.17	4
Mathematical Reasoning	ProcessBench (PB)	AUC0.757	4

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord