MaskDiME: Adaptive Masked Diffusion for Precise and Efficient Visual Counterfactual Explanations
About
Visual counterfactual explanations aim to reveal the minimal semantic modifications that can alter a model's prediction, providing causal and interpretable insights into deep neural networks. However, existing diffusion-based counterfactual generation methods are often computationally expensive, slow to sample, and imprecise in localizing the modified regions. To address these limitations, we propose MaskDiME, a simple, fast, and effective diffusion framework that unifies semantic consistency and spatial precision through localized sampling. Our approach adaptively focuses on decision-relevant regions to achieve localized and semantically consistent counterfactual generation while preserving high image fidelity. Our training-free framework, MaskDiME, achieves over 30x faster inference than the baseline method and achieves comparable or state-of-the-art performance across five benchmark datasets spanning diverse visual domains, establishing a practical and generalizable solution for efficient counterfactual explanation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Counterfactual Explanation (Age) | CelebA Standard | FID0.77 | 11 | |
| Visual Counterfactual Explanation (Smile) | CelebA Standard | FID0.71 | 11 | |
| Counterfactual Explanation | ImageNet Zebra - Sorrel | FID32.5 | 11 | |
| Counterfactual Explanation | ImageNet (Cheetah - Cougar) | FID37.4 | 11 | |
| Counterfactual Explanation | ImageNet Egyptian Cat - Persian Cat | FID36.2 | 11 | |
| Counterfactual Visual Explanation | BDD100K | FID3.19 | 10 | |
| Visual Counterfactual Explanation (Age) | CelebA-HQ | FID4.43 | 9 | |
| Visual Counterfactual Explanation (Smile) | CelebA-HQ | FID2.51 | 9 | |
| Counterfactual Visual Explanation | BDD-OIA | FID5.43 | 7 |