Making Large Language Models Better Reasoners with Step-Aware Verifier

About

Few-shot learning is a challenging task that requires language models to generalize from limited examples. Large language models like GPT-3 and PaLM have made impressive progress in this area, but they still face difficulties in reasoning tasks such as GSM8K, a benchmark for arithmetic problems. To improve their reasoning skills, previous work has proposed to guide the language model with prompts that elicit a series of reasoning steps before giving the final answer, achieving a significant improvement on GSM8K from 17.9% to 58.1% in problem-solving rate. In this paper, we present DIVERSE (Diverse Verifier on Reasoning Step), a novel approach that further enhances the reasoning capability of language models. DIVERSE has three main components: first, it generates diverse prompts to explore different reasoning paths for the same question; second, it uses a verifier to filter out incorrect answers based on a weighted voting scheme; and third, it verifies each reasoning step individually instead of the whole chain. We evaluate DIVERSE on the latest language model code-davinci-002 and show that it achieves new state-of-the-art results on six of eight reasoning benchmarks (e.g., GSM8K 74.4% to 83.2%).

Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, Weizhu Chen• 2022

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Accuracy82.3	1398
Mathematical Reasoning	MATH500 (test)	Accuracy47	895
Mathematical Reasoning	GSM8K (test)	Accuracy84.1	816
Mathematical Reasoning	GSM8K	Accuracy91.4	499
Mathematical Reasoning	SVAMP	Accuracy87	403
Mathematical Reasoning	AIME 2024	Accuracy34.55	370
Arithmetic Reasoning	MultiArith	Accuracy99.8	293
Math Reasoning	GSM8K (test)	Accuracy92.4	250
Mathematical Reasoning	AMC	Accuracy68.8	221
Commonsense Reasoning	StrategyQA	Accuracy78.6	208

Showing 10 of 33 rows

Other info

Code

Follow for update

@wizwand_team Discord