Answer Convergence as a Signal for Early Stopping in Reasoning

About

Chain-of-thought (CoT) prompting enhances reasoning in large language models (LLMs) but often leads to verbose and redundant outputs, thus increasing inference cost. We hypothesize that many reasoning steps are unnecessary for producing correct answers. To investigate this, we start with a systematic study to examine what is the minimum reasoning required for a model to reach a stable decision. We find that on math reasoning tasks like math, models typically converge to their final answers after 60\% of the reasoning steps, suggesting substantial redundancy in the remaining content. Based on these insights, we propose three inference-time strategies to improve efficiency: (1) early stopping via answer consistency, (2) boosting the probability of generating end-of-reasoning signals, and (3) a supervised method that learns when to stop based on internal activations. Experiments across five benchmarks and five open-weights LLMs show that our methods significantly reduce token usage with little or no accuracy drop. In particular, on NaturalQuestions, Answer Consistency reduces tokens by over 40\% while further improving accuracy. Our work underscores the importance of cost-effective reasoning methods that operate at inference time, offering practical benefits for real-world applications.

Xin Liu, Lu Wang• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	HMMT25	Accuracy42.7	119
Scientific Reasoning	GPQA Diamond	Accuracy44.7	41
Reasoning	AIME 25	Accuracy73.3	40
General Reasoning	Overall	Accuracy53.7	40
Mathematical Reasoning	OlymBench	Accuracy32.6	39
Reasoning	Overall Combined Benchmarks	Accuracy35.1	31
Mathematical Reasoning	GSM8K	Accuracy90.1	27
Mathematical Reasoning	AIME 24	Accuracy26.7	27
Mathematical Reasoning	AIME 25	Accuracy20	27
Scientific Reasoning	GPQA D	Accuracy35.9	27

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord