On Calibration of Modern Neural Networks
About
Confidence calibration -- the problem of predicting probability estimates representative of the true correctness likelihood -- is important for classification models in many applications. We discover that modern neural networks, unlike those from a decade ago, are poorly calibrated. Through extensive experiments, we observe that depth, width, weight decay, and Batch Normalization are important factors influencing calibration. We evaluate the performance of various post-processing calibration methods on state-of-the-art architectures with image and document classification datasets. Our analysis and experiments not only offer insights into neural network learning, but also provide a simple and straightforward recipe for practical settings: on most datasets, temperature scaling -- a single-parameter variant of Platt Scaling -- is surprisingly effective at calibrating predictions.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | CIFAR-100 | Top-1 Accuracy76.74 | 622 | |
| Image Classification | Food-101 | Accuracy86.6 | 494 | |
| Image Classification | ImageNet LT | Top-1 Accuracy37.9 | 251 | |
| Long-Tailed Image Classification | ImageNet-LT (test) | -- | 220 | |
| Out-of-Distribution Detection | iNaturalist | FPR@9537.63 | 200 | |
| Image Classification | ImageNet-LT (test) | -- | 159 | |
| Node Classification | Computers | -- | 143 | |
| Out-of-Distribution Detection | Textures | AUROC0.8539 | 141 | |
| Commonsense Reasoning | ARC Challenge | Accuracy64.9 | 132 | |
| Out-of-Distribution Detection | OpenImage-O | AUROC87.22 | 107 |