Winsor-CAM: Human-Tunable Visual Explanations from Deep Networks via Layer-Wise Winsorization
About
Interpreting Convolutional Neural Networks (CNNs) is critical for safety-sensitive applications such as healthcare and autonomous systems. Popular visual explanation methods like Grad-CAM use a single convolutional layer, potentially missing multi-scale cues and producing unstable saliency maps. We introduce Winsor-CAM, a single-pass gradient-based method that aggregates Grad-CAM maps from all convolutional layers and applies percentile-based Winsorization to attenuate outlier contributions. A user-controllable percentile parameter p enables semantic-level tuning from low-level textures to high-level object patterns. We evaluate Winsor-CAM on six CNN architectures using PASCAL VOC 2012 and PolypGen, comparing localization (IoU, center-of-mass distance) and fidelity (insertion/deletion AUC) against seven baselines including Grad-CAM, Grad-CAM++, LayerCAM, ScoreCAM, AblationCAM, ShapleyCAM, and FullGrad. On DenseNet121 with a subset of Pascal VOC 2012, Winsor-CAM achieves 46.8% IoU and 0.059 CoM distance versus 39.0% and 0.074 for Grad-CAM, with improved insertion AUC (0.656 vs. 0.623) and deletion AUC (0.197 vs. 0.242). Notably, even the worst-performing fixed p-value configuration outperforms FullGrad across all metrics. An ablation study confirms that incorporating earlier layers improves localization. Similar evaluation on PolypGen polyp segmentation further validates Winsor-CAM's effectiveness in medical imaging contexts. Winsor-CAM provides an efficient, robust, and human-tunable explanation tool for expert-in-the-loop analysis.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Explainable AI Evaluation | PASCAL VOC Subset 2012 | IoU46.8 | 8 |