Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Counterfactual Stress Testing for Image Classification Models

About

Deep learning models in medical imaging often fail when deployed in new clinical environments due to distribution shifts in demographics, scanner hardware, or acquisition protocols. A central challenge is underspecification, where models with similar validation performance exhibit divergent real-world failure modes. Although stress testing has emerged as a tool to assess this, current methods typically rely on simple, uninformed perturbations (e.g., brightness or contrast changes), which fail to capture clinically realistic variation and can overestimate robustness. In this work, we introduce a counterfactual stress testing framework based on causal generative models that create realistic "what if" images by intervening on attributes such as scanner type and patient sex while preserving anatomical identity, enabling controlled and semantically meaningful evaluation under targeted distribution shifts. Across two imaging modalities (chest X-ray and mammography), three model architectures, and multiple shift scenarios, we show that counterfactual stress tests provide a substantially more accurate proxy for real out-of-distribution performance than classical perturbations, capturing the direction and relative magnitude of performance changes as well as model ranking. These results suggest that causal generative models can serve as practical simulators for robustness assessment, offering a more reliable basis for evaluating medical AI systems prior to deployment.

Moritz Stammel, Fabio De Sousa Ribeiro, Raghav Mehta, M\'elanie Roschewitz, Ben Glocker• 2026

Related benchmarks

TaskDatasetResultRank
Performance Shift PredictionPadChest Single Sex Male
MAE0.013
6
Performance Shift PredictionPadChest Single Sex Female
MAE0.011
6
Performance Shift PredictionPadChest Single Scanner: Philips
MAE0.017
6
Performance Shift PredictionPadChest Composite Philips Male
MAE0.003
6
Performance Shift PredictionPadChest Composite Philips: Female
MAE0.008
6
Performance Shift PredictionPadChest Composite IDC: Male
MAE0.009
6
Performance Shift PredictionPadChest Composite IDC: Female
MAE3.2
6
Performance Shift PredictionEMBED Scanner 2000D
MAE0.031
6
Performance Shift PredictionEMBED Scanner: Clearview
MAE0.003
6
Performance Shift PredictionPadChest Single Scanner: IDC
MAE0.032
6
Showing 10 of 11 rows

Other info

Follow for update