When Individually Calibrated Models Become Collectively Miscalibrated

About

Probabilistic prediction systems often aggregate probability estimates from multiple models into a single decision. A common assumption is that if each model is individually calibrated, the aggregate prediction will also be well calibrated. We show that this assumption fails in multi-agent settings: individually calibrated predictors can become collectively miscalibrated when their predictions interact strategically, in the game-theoretic sense of Brier-optimal local response, even without deliberate coordination. This phenomenon arises naturally when agents are independently trained on overlapping data. We prove that under Brier-score-based aggregation with positively correlated beliefs, each agent's individually optimal report systematically underestimates the positive-class probability, yielding a Price of Anarchy greater than one whenever Cov(b_i, b_j) > 0. In a canonical setting (n = 5 agents, pairwise correlation = 0.5, base rate = 0.3), the empirically measured PoA in false-negative rate reaches 7.25x. In contrast, VCG-based aggregation aligns incentives by rewarding marginal contribution, achieving dominant-strategy incentive compatibility and near-optimal performance. Experiments on three real-world datasets (NSL-KDD, UNSW-NB15, Credit Card Fraud) show that VCG provides strong robustness while maintaining comparable accuracy. It performs particularly well in data-sparse and adversarial settings, and adaptive weighting further improves performance under distribution shift.

Zhaohui Wang• 2026

Related benchmarks

Task	Dataset	Result
Intrusion Detection	NSL-KDD (test)	Recall99.1	35
Intrusion Detection	UNSW-NB15 (test)	F1 Score94.8	33
Fraud Detection	Credit Card Fraud Detection (test)	Recall83.8	26
Classification	UCI Heart Disease	FN Rate20.8	6
Classification	Pima Diabetes	False Negative Rate40.9	6
Multi-class intrusion detection	CICIDS 2017 (test)	Accuracy99.5	6
Multi-class classification	Intrusion Detection 12 classes, severe imbalance	Accuracy91.1	5

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord