Calibration without Ground Truth
About
Villalobos et al. [2024] predict that publicly available human text will be exhausted within the next decade. Thus, improving models without access to ground-truth labels becomes increasingly important. We propose a label-free post-processing framework that improves a strong but miscalibrated model using a weaker yet better-calibrated reference. Our framework guarantees a strict performance improvement under any proper loss. Our approach is based on a characterization of when strict improvement is possible: when the strong and reference models are not mutually calibrated. We formalize this condition, connect it to arbitrage and no-trade results from economics, and develop an efficient Bregman projection algorithm that guarantees worst-case loss reduction without labels. Experiments on representative LLMs across varying scales demonstrate that our label-free method significantly reduces proper losses and calibration errors, achieving performance competitive with supervised baselines.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | CommonsenseQA | BS0.1262 | 54 | |
| Language Understanding | MMLU-Redux | Base Score0.3571 | 24 | |
| Knowledge Evaluation | MMLU-Redux | Brier Score0.1232 | 18 | |
| Multiple-choice Question Answering | MMLU Redux (test) | BS0.1232 | 12 |