Escape Sky-high Cost: Early-stopping Self-Consistency for Multi-step Reasoning

About

Self-consistency (SC) has been a widely used decoding strategy for chain-of-thought reasoning. Despite bringing significant performance improvements across a variety of multi-step reasoning tasks, it is a high-cost method that requires multiple sampling with the preset size. In this paper, we propose a simple and scalable sampling process, \textbf{E}arly-Stopping \textbf{S}elf-\textbf{C}onsistency (ESC), to greatly reduce the cost of SC without sacrificing performance. On this basis, one control scheme for ESC is further derivated to dynamically choose the performance-cost balance for different tasks and models. To demonstrate ESC's effectiveness, we conducted extensive experiments on three popular categories of reasoning tasks: arithmetic, commonsense and symbolic reasoning over language models with varying scales. The empirical results show that ESC reduces the average number of sampling of chain-of-thought reasoning by a significant margin on six benchmarks, including MATH (-33.8%), GSM8K (-80.1%), StrategyQA (-76.8%), CommonsenseQA (-78.5%), Coin Flip (-84.2%) and Last Letters (-67.4%), while attaining comparable performances.

Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, Kan Li• 2024

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Accuracy97.04	499
Mathematical Reasoning	MathQA	Accuracy84.7	354
Reasoning	GPQA Diamond	Accuracy45.69	185
Multimodal Reasoning	LogicVista	Accuracy54.6	172
Mathematical Reasoning	Omni-MATH	Accuracy42.8	135
Mathematical Reasoning	HMMT25	Accuracy48.9	119
High-resolution Visual Understanding	HR-Bench-8K	FSP93	83
Financial Reasoning	FinQA	Accuracy70.4	69
Visual Reasoning	V*Bench	Accuracy87	62
Mathematical Reasoning	MathVision (test)	Accuracy22.7	53

Showing 10 of 79 rows

...

Other info

Follow for update

@wizwand_team Discord