Calibrated ensembles can mitigate accuracy tradeoffs under distribution shift
About
We often see undesirable tradeoffs in robust machine learning where out-of-distribution (OOD) accuracy is at odds with in-distribution (ID) accuracy: a robust classifier obtained via specialized techniques such as removing spurious features often has better OOD but worse ID accuracy compared to a standard classifier trained via ERM. In this paper, we find that ID-calibrated ensembles -- where we simply ensemble the standard and robust models after calibrating on only ID data -- outperforms prior state-of-the-art (based on self-training) on both ID and OOD accuracy. On eleven natural distribution shift datasets, ID-calibrated ensembles obtain the best of both worlds: strong ID accuracy and OOD accuracy. We analyze this method in stylized settings, and identify two important conditions for ensembles to perform well both ID and OOD: (1) we need to calibrate the standard and robust models (on ID data, because OOD data is unavailable), (2) OOD has no anticorrelated spurious features.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | CIFAR-10 | Accuracy98.6 | 875 | |
| Image Classification | ImageNet-1k (val) | Accuracy82.2 | 199 | |
| Image Classification | ImageNet and Distribution Shifts | ImageNet-V2 Accuracy72.3 | 49 | |
| Image Classification | STL-10 OOD | Accuracy97.7 | 24 | |
| Image Classification | Entity-30 ID | Accuracy97.2 | 20 | |
| Image Classification | Entity-30 OOD | Accuracy71.8 | 20 |