Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

On Calibration of Large Language Models: From Response To Capability

About

Large language models (LLMs) are widely deployed as general-purpose problem solvers, making accurate confidence estimation critical for reliable use. Prior work on LLM calibration largely focuses on response-level confidence, which estimates the correctness of a single generated output. However, this formulation is misaligned with many practical settings where the central question is how likely a model is to solve a query overall. We show that this mismatch results from the stochastic nature of modern LLM decoding, under which single-response correctness fails to reflect underlying model capability. To address this issue, we introduce capability calibration, which targets the model's expected accuracy on a query. We formally distinguish capability calibration from response calibration and show that the two differ both theoretically and empirically. We establish an empirical evaluation setup and study a range of confidence estimation methods. Our results demonstrate that capability-calibrated confidence improves pass@$k$ prediction and inference budget allocation, establishing a foundation with potential for diverse applications.

Sin-Han Yang, Cheng-Kuang Wu, Chieh-Yen Lin, Yun-Nung Chen, Hung-yi Lee, Shao-Hua Sun• 2026

Related benchmarks

TaskDatasetResultRank
CalibrationMMLU
Brier Score0.0686
42
CalibrationTriviaQA
Brier Score0.0845
39
Confidence calibrationSimpleQA
Brier Score0.0386
27
Capability CalibrationMATH
Brier Score0.0267
18
Capability CalibrationGSM8K
Brier Score0.0289
18
Capability CalibrationAIME 25
Brier Score0.074
18
Capability CalibrationGPQA
Brier Score0.1242
18
Showing 7 of 7 rows

Other info

Follow for update