Thought calibration: Efficient and confident test-time scaling

About

Reasoning large language models achieve impressive test-time scaling by thinking for longer, but this performance gain comes at significant compute cost. Directly limiting test-time budget hurts overall performance, but not all problems are equally difficult. We propose thought calibration to decide dynamically when thinking can be terminated. To calibrate our decision rule, we view a language model's growing body of thoughts as a nested sequence of reasoning trees, where the goal is to identify the point at which novel reasoning plateaus. We realize this framework through lightweight probes that operate on top of the language model's hidden representations, which are informative of both the reasoning structure and overall consistency of response. Based on three reasoning language models and four datasets, thought calibration preserves model performance with up to a 60% reduction in thinking tokens on in-distribution data, and up to 20% in out-of-distribution data.

Menghua Wu, Cai Zhou, Stephen Bates, Tommi Jaakkola• 2025

Related benchmarks

Task	Dataset	Result
Code Generation	HumanEval	--	1048
Mathematical Reasoning	AIME 25	Accuracy83.7	48
Mathematical Reasoning	MATH 500	Accuracy90.1	37
Scientific Reasoning	GPQA	Accuracy54.6	28
Early-stopping for mathematical reasoning	5K corpus 1.0 (test)	Savings Ratio62.5	24
General Reasoning	Overall MATH-500 AIME25 HumanEval GPQA	Accuracy70.6	24
Reasoning step reduction	In-Distribution 5K corpus (test)	Savings Rate38	9
Out-of-Distribution Generalization	AIME 26	Saving Score14.7	6
Out-of-Distribution Generalization	GPQA Diamond OOD	Sav.64.3	6
Out-of-Distribution Generalization	AIME 24	Saving Score15.8	6

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord