ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference
About
Despite the success of large-scale pretrained Vision-Language Models (VLMs) especially CLIP in various open-vocabulary tasks, their application to semantic segmentation remains challenging, producing noisy segmentation maps with mis-segmented regions. In this paper, we carefully re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades segmentation quality. With a comparative analysis of statistical properties in the residual connection and the attention output across different pretrained models, we discover that CLIP's image-text contrastive training paradigm emphasizes global features at the expense of local discriminability, leading to noisy segmentation results. In response, we propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation. We introduce three simple modifications to the final layer: removing the residual connection, implementing the self-self attention, and discarding the feed-forward network. ClearCLIP consistently generates clearer and more accurate segmentation maps and outperforms existing approaches across multiple benchmarks, affirming the significance of our discoveries.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K (val) | mIoU16.7 | 2888 | |
| Semantic segmentation | PASCAL VOC 2012 (val) | Mean IoU51.8 | 2142 | |
| Semantic segmentation | ADE20K | mIoU18.9 | 1024 | |
| Semantic segmentation | Cityscapes | mIoU3.18e+3 | 658 | |
| Semantic segmentation | Cityscapes (val) | mIoU18.3 | 572 | |
| Semantic segmentation | COCO Stuff | mIoU23.9 | 379 | |
| Semantic segmentation | Cityscapes (val) | mIoU30 | 374 | |
| Semantic segmentation | ADE20K | mIoU16.9 | 366 | |
| Semantic segmentation | PASCAL VOC (val) | mIoU80.9 | 362 | |
| Semantic segmentation | PASCAL Context (val) | mIoU34.9 | 360 |