Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

About

A promising approach for improving reasoning in large language models is to use process reward models (PRMs). PRMs provide feedback at each step of a multi-step reasoning trace, potentially improving credit assignment over outcome reward models (ORMs) that only provide feedback at the final step. However, collecting dense, per-step human labels is not scalable, and training PRMs from automatically-labeled data has thus far led to limited gains. To improve a base policy by running search against a PRM or using it as dense rewards for reinforcement learning (RL), we ask: "How should we design process rewards?". Our key insight is that, to be effective, the process reward for a step should measure progress: a change in the likelihood of producing a correct response in the future, before and after taking the step, corresponding to the notion of step-level advantages in RL. Crucially, this progress should be measured under a prover policy distinct from the base policy. We theoretically characterize the set of good provers and our results show that optimizing process rewards from such provers improves exploration during test-time search and online RL. In fact, our characterization shows that weak prover policies can substantially improve a stronger base policy, which we also observe empirically. We validate our claims by training process advantage verifiers (PAVs) to predict progress under such provers, and show that compared to ORMs, test-time search against PAVs is $>8\%$ more accurate, and $1.5-5\times$ more compute-efficient. Online RL with dense rewards from PAVs enables one of the first results with $5-6\times$ gain in sample efficiency, and $>6\%$ gain in accuracy, over ORMs.

Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, Aviral Kumar• 2024

Related benchmarks

TaskDatasetResultRank
Process-level EvaluationPROCESSBENCH MATH
F1 Score43.8
7
Process-level EvaluationProcessBench GSM8K
F1 Score51.8
7
Process-level EvaluationProcessBench Olympiad
F1 Score27.6
7
Process-level EvaluationProcessBench Average
Mean F136.6
7
Process-level EvaluationProcessBench Omni
F1 Score23.1
7
Step-level quality assessmentPRMBENCH
Simplicity47.16
5
Mathematical ReasoningPRMBENCH
PRMScore49.6
4
Mathematical ReasoningMATH 500
MATH Accuracy (BoN@8)47.2
4
Computation Cost AnalysisReasoning Samples 10K
Time Ratio1.17
4
Mathematical ReasoningProcessBench (PB)
AUC0.757
4
Showing 10 of 10 rows

Other info

Follow for update