Let's Verify Step by Step

About

In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning. However, even state-of-the-art models still regularly produce logical mistakes. To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step. Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare the both methods. Recent work has already begun this comparison, but many questions still remain. We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. Additionally, we show that active learning significantly improves the efficacy of process supervision. To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model.

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe• 2023

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Accuracy93.3	1424
Mathematical Reasoning	MATH500 (test)	Accuracy62.2	922
Science Question Answering	ScienceQA	Accuracy97.5	916
Mathematical Reasoning	MATH	Accuracy57.8	882
Mathematical Reasoning	GSM8K (test)	Accuracy89.2	816
Mathematical Reasoning	MATH 500	Accuracy (Acc)76.1	600
Text-to-Image Generation	GenEval	Overall Score44	581
Multitask Language Understanding	MMLU	Accuracy82.6	568
Mathematical Reasoning	AIME 2024	Accuracy33.3	525
Mathematical Reasoning	GSM8K	Accuracy86.5	499

Showing 10 of 120 rows

...

Other info

Code

Follow for update

@wizwand_team Discord