Calibration without Ground Truth

About

Villalobos et al. [2024] predict that publicly available human text will be exhausted within the next decade. Thus, improving models without access to ground-truth labels becomes increasingly important. We propose a label-free post-processing framework that improves a strong but miscalibrated model using a weaker yet better-calibrated reference. Our framework guarantees a strict performance improvement under any proper loss. Our approach is based on a characterization of when strict improvement is possible: when the strong and reference models are not mutually calibrated. We formalize this condition, connect it to arbitrage and no-trade results from economics, and develop an efficient Bregman projection algorithm that guarantees worst-case loss reduction without labels. Experiments on representative LLMs across varying scales demonstrate that our label-free method significantly reduces proper losses and calibration errors, achieving performance competitive with supervised baselines.

Yuqing Kong, Mingyu Song, Yizhou Wang, Yifan Wu• 2026

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	CommonsenseQA	BS0.1262	54
Language Understanding	MMLU-Redux	Accuracy85.75	29
Knowledge Evaluation	MMLU-Redux	Brier Score0.1232	18
Multiple-choice Question Answering	MMLU Redux (test)	Accuracy83.28	13

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord