Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Escape Sky-high Cost: Early-stopping Self-Consistency for Multi-step Reasoning

About

Self-consistency (SC) has been a widely used decoding strategy for chain-of-thought reasoning. Despite bringing significant performance improvements across a variety of multi-step reasoning tasks, it is a high-cost method that requires multiple sampling with the preset size. In this paper, we propose a simple and scalable sampling process, \textbf{E}arly-Stopping \textbf{S}elf-\textbf{C}onsistency (ESC), to greatly reduce the cost of SC without sacrificing performance. On this basis, one control scheme for ESC is further derivated to dynamically choose the performance-cost balance for different tasks and models. To demonstrate ESC's effectiveness, we conducted extensive experiments on three popular categories of reasoning tasks: arithmetic, commonsense and symbolic reasoning over language models with varying scales. The empirical results show that ESC reduces the average number of sampling of chain-of-thought reasoning by a significant margin on six benchmarks, including MATH (-33.8%), GSM8K (-80.1%), StrategyQA (-76.8%), CommonsenseQA (-78.5%), Coin Flip (-84.2%) and Last Letters (-67.4%), while attaining comparable performances.

Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, Kan Li• 2024

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Accuracy97.04
351
ReasoningGPQA Diamond
Accuracy45.69
88
Mathematical ReasoningHMMT25
Accuracy48.9
78
Mathematical ReasoningOmni-MATH
Accuracy42.8
68
Visual ReasoningV*Bench
Accuracy87
58
Mathematical ReasoningMathVision (test)
Accuracy22.7
41
ReasoningAIME 25
Accuracy76.7
40
General Knowledge ReasoningMMLU-Pro
Accuracy75.37
31
Mathematical ReasoningMATH500
Acc82.6
30
High-resolution Visual UnderstandingHR-Bench-8K
FSP93
29
Showing 10 of 47 rows

Other info

Follow for update