Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Rethinking Early Stopping: Refine, Then Calibrate

About

Machine learning classifiers often produce probabilistic predictions that are critical for accurate and interpretable decision-making in various domains. The quality of these predictions is generally evaluated with proper losses, such as cross-entropy, which decompose into two components: calibration error assesses general under/overconfidence, while refinement error measures the ability to distinguish different classes. In this paper, we present a novel variational formulation of the calibration-refinement decomposition that sheds new light on post-hoc calibration, and enables rapid estimation of the different terms. Equipped with this new perspective, we provide theoretical and empirical evidence that calibration and refinement errors are not minimized simultaneously during training. Selecting the best epoch based on validation loss thus leads to a compromise point that is suboptimal for both terms. To address this, we propose minimizing refinement error only during training (Refine,...), before minimizing calibration error post hoc, using standard techniques (...then Calibrate). Our method integrates seamlessly with any classifier and consistently improves performance across diverse classification tasks.

Eug\`ene Berta, David Holzm\"uller, Michael I. Jordan, Francis Bach• 2025

Related benchmarks

TaskDatasetResultRank
Model CalibrationCIFAR10 (test)--
61
Multi-class CalibrationCIFAR-100 logits (test)
LogLoss Absolute Improvement-0.961
60
Multi-class CalibrationImageNet (test)
NLL Improvement (Absolute)-0.047
12
Multi-class Post-hoc CalibrationTabRepo 1365 multi-class experiments (test)
Brier Score Difference-0.0024
9
Binary calibration2184 binary experiments (test)
Brier Score Gap-0.0018
6
Showing 5 of 5 rows

Other info

Follow for update