When Individually Calibrated Models Become Collectively Miscalibrated
About
Probabilistic prediction systems often aggregate probability estimates from multiple models into a single decision. A common assumption is that if each model is individually calibrated, the aggregate prediction will also be well calibrated. We show that this assumption fails in multi-agent settings: individually calibrated predictors can become collectively miscalibrated when their predictions interact strategically, in the game-theoretic sense of Brier-optimal local response, even without deliberate coordination. This phenomenon arises naturally when agents are independently trained on overlapping data. We prove that under Brier-score-based aggregation with positively correlated beliefs, each agent's individually optimal report systematically underestimates the positive-class probability, yielding a Price of Anarchy greater than one whenever Cov(b_i, b_j) > 0. In a canonical setting (n = 5 agents, pairwise correlation = 0.5, base rate = 0.3), the empirically measured PoA in false-negative rate reaches 7.25x. In contrast, VCG-based aggregation aligns incentives by rewarding marginal contribution, achieving dominant-strategy incentive compatibility and near-optimal performance. Experiments on three real-world datasets (NSL-KDD, UNSW-NB15, Credit Card Fraud) show that VCG provides strong robustness while maintaining comparable accuracy. It performs particularly well in data-sparse and adversarial settings, and adaptive weighting further improves performance under distribution shift.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Intrusion Detection | UNSW-NB15 (test) | F1 Score94.8 | 33 | |
| Fraud Detection | Credit Card Fraud Detection (test) | Recall83.8 | 14 | |
| Intrusion Detection | NSL-KDD (test) | Recall99.1 | 11 | |
| Classification | UCI Heart Disease | FN Rate20.8 | 6 | |
| Classification | Pima Diabetes | False Negative Rate40.9 | 6 | |
| Multi-class intrusion detection | CICIDS 2017 (test) | Accuracy99.5 | 6 | |
| Multi-class classification | Intrusion Detection 12 classes, severe imbalance | Accuracy91.1 | 5 |