Escape Sky-high Cost: Early-stopping Self-Consistency for Multi-step Reasoning
About
Self-consistency (SC) has been a widely used decoding strategy for chain-of-thought reasoning. Despite bringing significant performance improvements across a variety of multi-step reasoning tasks, it is a high-cost method that requires multiple sampling with the preset size. In this paper, we propose a simple and scalable sampling process, \textbf{E}arly-Stopping \textbf{S}elf-\textbf{C}onsistency (ESC), to greatly reduce the cost of SC without sacrificing performance. On this basis, one control scheme for ESC is further derivated to dynamically choose the performance-cost balance for different tasks and models. To demonstrate ESC's effectiveness, we conducted extensive experiments on three popular categories of reasoning tasks: arithmetic, commonsense and symbolic reasoning over language models with varying scales. The empirical results show that ESC reduces the average number of sampling of chain-of-thought reasoning by a significant margin on six benchmarks, including MATH (-33.8%), GSM8K (-80.1%), StrategyQA (-76.8%), CommonsenseQA (-78.5%), Coin Flip (-84.2%) and Last Letters (-67.4%), while attaining comparable performances.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | GSM8K | Accuracy97.04 | 499 | |
| Mathematical Reasoning | MathQA | Accuracy84.7 | 305 | |
| Reasoning | GPQA Diamond | Accuracy45.69 | 135 | |
| Multimodal Reasoning | LogicVista | Accuracy54.6 | 99 | |
| Mathematical Reasoning | HMMT25 | Accuracy48.9 | 95 | |
| Mathematical Reasoning | Omni-MATH | Accuracy42.8 | 93 | |
| High-resolution Visual Understanding | HR-Bench-8K | FSP93 | 73 | |
| Visual Reasoning | V*Bench | Accuracy87 | 58 | |
| Mathematical Reasoning | MathVision (test) | Accuracy22.7 | 53 | |
| Reasoning | AIME 25 | Accuracy76.7 | 40 |