On Calibration of Large Language Models: From Response To Capability

About

Large language models (LLMs) are widely deployed as general-purpose problem solvers, making accurate confidence estimation critical for reliable use. Prior work on LLM calibration largely focuses on response-level confidence, which estimates the correctness of a single generated output. However, this formulation is misaligned with many practical settings where the central question is how likely a model is to solve a query overall. We show that this mismatch results from the stochastic nature of modern LLM decoding, under which single-response correctness fails to reflect underlying model capability. To address this issue, we introduce capability calibration, which targets the model's expected accuracy on a query. We formally distinguish capability calibration from response calibration and show that the two differ both theoretically and empirically. We establish an empirical evaluation setup and study a range of confidence estimation methods. Our results demonstrate that capability-calibrated confidence improves pass@$k$ prediction and inference budget allocation, establishing a foundation with potential for diverse applications.

Sin-Han Yang, Cheng-Kuang Wu, Chieh-Yen Lin, Yun-Nung Chen, Hung-yi Lee, Shao-Hua Sun• 2026

Related benchmarks

Task	Dataset	Result
Calibration	MMLU	Brier Score0.0686	58
Calibration	TriviaQA	Brier Score0.0845	39
Confidence calibration	SimpleQA	Brier Score0.0386	27
Capability Calibration	MATH	Brier Score0.0267	18
Capability Calibration	GSM8K	Brier Score0.0289	18
Capability Calibration	AIME 25	Brier Score0.074	18
Capability Calibration	GPQA	Brier Score0.1242	18

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord