Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

About

Despite the success of large-scale pretrained Vision-Language Models (VLMs) especially CLIP in various open-vocabulary tasks, their application to semantic segmentation remains challenging, producing noisy segmentation maps with mis-segmented regions. In this paper, we carefully re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades segmentation quality. With a comparative analysis of statistical properties in the residual connection and the attention output across different pretrained models, we discover that CLIP's image-text contrastive training paradigm emphasizes global features at the expense of local discriminability, leading to noisy segmentation results. In response, we propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation. We introduce three simple modifications to the final layer: removing the residual connection, implementing the self-self attention, and discarding the feed-forward network. ClearCLIP consistently generates clearer and more accurate segmentation maps and outperforms existing approaches across multiple benchmarks, affirming the significance of our discoveries.

Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, Wayne Zhang• 2024

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU16.7
2731
Semantic segmentationADE20K
mIoU18.9
936
Semantic segmentationCityscapes
mIoU3.18e+3
578
Semantic segmentationCityscapes (val)
mIoU18.3
572
Semantic segmentationPASCAL VOC (val)
mIoU80.9
338
Semantic segmentationCityscapes (val)
mIoU30
332
Semantic segmentationPASCAL Context (val)
mIoU34.9
323
Semantic segmentationCOCO Stuff
mIoU23.9
195
Semantic segmentationADE20K A-150
mIoU17.7
188
Semantic segmentationPascal Context 59
mIoU35.9
164
Showing 10 of 73 rows
...

Other info

Follow for update