Training Verifiers to Solve Math Word Problems

About

State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high test performance, despite the conceptual simplicity of this problem distribution. To increase performance, we propose training verifiers to judge the correctness of model completions. At test time, we generate many candidate solutions and select the one ranked highest by the verifier. We demonstrate that verification significantly improves performance on GSM8K, and we provide strong empirical evidence that verification scales more effectively with increased data than a finetuning baseline.

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman• 2021

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Accuracy60	1424
Code Generation	HumanEval	--	1048
Mathematical Reasoning	MATH500 (test)	Accuracy71.6	922
Mathematical Reasoning	GSM8K (test)	Accuracy91.2	816
Math Reasoning	GSM8K (test)	Accuracy57	276
Arithmetic Reasoning	GSM8K	Accuracy55	272
Mathematical Problem Solving	MATH	Accuracy51.2	229
Code Generation	HumanEval	Accuracy89.3	224
Mathematical Reasoning	GSM8K	--	220
Mathematical Reasoning	AMC23	PASS@1 Accuracy50.8	216

Showing 10 of 63 rows

Other info

Follow for update

@wizwand_team Discord