Clustered Calibration: Representation-Aware Probability Calibration via Learned Subpopulations
About
Ensuring that predicted probabilities align with observed frequencies is critical in high-stakes domains such as clinical decision support, autonomous driving and financial risk assessment. Existing calibration methods typically apply a single global transformation or rely on post-hoc binning over predicted confidences, limiting their ability to exploit heterogeneous reliability across sub-populations. We propose Clustered Calibration, a representation-aware framework that identifies sub-populations via clustering in learned feature spaces (e.g., coverage vectors, SHAP values, CNN activations, Transformer embeddings) and fits a soft mixture of cluster-specific parametric calibrators under hierarchical shrinkage toward a global mapping. This design yields context-specific calibration while maintaining global stability. Across six tabular datasets and additional image and text benchmarks, clustered calibration consistently improves or matches strong global calibrators in terms of negative log-likelihood and Brier score, while preserving AUC and accuracy. We further show, both analytically and empirically, that fixed-bin Expected Calibration Error (ECE) can mis-rank soft, region-aware calibrators even when proper scoring rules improve, and we advocate for log-loss and Brier as more reliable bases for model selection in such settings.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification Calibration | CIFAR100 | Classwise ECE0.0386 | 99 | |
| Calibration | Tabular datasets | NLL0.2983 | 21 | |
| Image Classification Calibration | ImageNet | Accuracy79.19 | 15 | |
| Text Classification | IMDB binary sentiment (five random splits) | NLL0.324 | 11 | |
| Image Classification Calibration | BloodMNIST | NLL0.3121 | 9 | |
| Text Classification | Emotion multi-class (five random splits) | NLL0.157 | 9 | |
| Classification Calibration | Adult | Delta NLL (%)12 | 1 | |
| Classification Calibration | Credit | Delta NLL (%)1.55 | 1 | |
| Classification Calibration | Diabetes130 | Delta NLL (%)17 | 1 | |
| Classification Calibration | LOS | Delta NLL0.16 | 1 |