Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

When Individually Calibrated Models Become Collectively Miscalibrated

About

Probabilistic prediction systems often aggregate probability estimates from multiple models into a single decision. A common assumption is that if each model is individually calibrated, the aggregate prediction will also be well calibrated. We show that this assumption fails in multi-agent settings: individually calibrated predictors can become collectively miscalibrated when their predictions interact strategically, in the game-theoretic sense of Brier-optimal local response, even without deliberate coordination. This phenomenon arises naturally when agents are independently trained on overlapping data. We prove that under Brier-score-based aggregation with positively correlated beliefs, each agent's individually optimal report systematically underestimates the positive-class probability, yielding a Price of Anarchy greater than one whenever Cov(b_i, b_j) > 0. In a canonical setting (n = 5 agents, pairwise correlation = 0.5, base rate = 0.3), the empirically measured PoA in false-negative rate reaches 7.25x. In contrast, VCG-based aggregation aligns incentives by rewarding marginal contribution, achieving dominant-strategy incentive compatibility and near-optimal performance. Experiments on three real-world datasets (NSL-KDD, UNSW-NB15, Credit Card Fraud) show that VCG provides strong robustness while maintaining comparable accuracy. It performs particularly well in data-sparse and adversarial settings, and adaptive weighting further improves performance under distribution shift.

Zhaohui Wang• 2026

Related benchmarks

TaskDatasetResultRank
Intrusion DetectionUNSW-NB15 (test)
F1 Score94.8
33
Fraud DetectionCredit Card Fraud Detection (test)
Recall83.8
14
Intrusion DetectionNSL-KDD (test)
Recall99.1
11
ClassificationUCI Heart Disease
FN Rate20.8
6
ClassificationPima Diabetes
False Negative Rate40.9
6
Multi-class intrusion detectionCICIDS 2017 (test)
Accuracy99.5
6
Multi-class classificationIntrusion Detection 12 classes, severe imbalance
Accuracy91.1
5
Showing 7 of 7 rows

Other info

Follow for update