Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

A Closer Look at the Explainability of Contrastive Language-Image Pre-training

About

Contrastive language-image pre-training (CLIP) is a powerful vision-language model that has shown great benefits for various tasks. However, we have identified some issues with its explainability, which undermine its credibility and limit the capacity for related tasks. Specifically, we find that CLIP tends to focus on background regions rather than foregrounds, with noisy activations at irrelevant positions on the visualization results. These phenomena conflict with conventional explainability methods based on the class attention map (CAM), where the raw model can highlight the local foreground regions using global supervision without alignment. To address these problems, we take a closer look at its architecture and features. Based on thorough analyses, we find the raw self-attentions link to inconsistent semantic regions, resulting in the opposite visualization. Besides, the noisy activations are owing to redundant features among categories. Building on these insights, we propose the CLIP Surgery for reliable CAM, a method that allows surgery-like modifications to the inference architecture and features, without further fine-tuning as classical CAM methods. This approach significantly improves the explainability of CLIP, surpassing existing methods by large margins. Besides, it enables multimodal visualization and extends the capacity of raw CLIP on open-vocabulary tasks without extra alignment. The code is available at https://github.com/xmed-lab/CLIP_Surgery.

Yi Li, Hualiang Wang, Yiqun Duan, Jiheng Zhang, Xiaomeng Li• 2023

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K
mIoU16.1
936
Semantic segmentationCityscapes
mIoU31.4
578
Semantic segmentationCOCO Stuff
mIoU2.97e+3
195
Semantic segmentationADE20K A-150
mIoU31.4
188
Camouflaged Object DetectionCOD10K (test)
S-measure (S_alpha)0.601
174
Semantic segmentationPascal Context 59
mIoU29.3
164
Multi-Label ClassificationNUS-WIDE (test)
mAP40.53
112
Semantic segmentationPascal Context
mIoU30.5
111
Semantic segmentationPotsdam (test)
mIoU30.2
104
Semantic segmentationCOCO
mIoU25.2
96
Showing 10 of 46 rows

Other info

Follow for update