ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

About

Despite the success of large-scale pretrained Vision-Language Models (VLMs) especially CLIP in various open-vocabulary tasks, their application to semantic segmentation remains challenging, producing noisy segmentation maps with mis-segmented regions. In this paper, we carefully re-investigate the architecture of CLIP, and identify residual connections as the primary source of noise that degrades segmentation quality. With a comparative analysis of statistical properties in the residual connection and the attention output across different pretrained models, we discover that CLIP's image-text contrastive training paradigm emphasizes global features at the expense of local discriminability, leading to noisy segmentation results. In response, we propose ClearCLIP, a novel approach that decomposes CLIP's representations to enhance open-vocabulary semantic segmentation. We introduce three simple modifications to the final layer: removing the residual connection, implementing the self-self attention, and discarding the feed-forward network. ClearCLIP consistently generates clearer and more accurate segmentation maps and outperforms existing approaches across multiple benchmarks, affirming the significance of our discoveries.

Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, Wayne Zhang• 2024

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU16.7	3069
Semantic segmentation	PASCAL VOC 2012 (val)	Mean IoU51.8	2204
Semantic segmentation	ADE20K	mIoU18.9	1028
Semantic segmentation	Cityscapes	mIoU3.18e+3	668
Semantic segmentation	Cityscapes (val)	mIoU18.3	572
Semantic segmentation	ADE20K	mIoU16.9	559
Semantic segmentation	Cityscapes (val)	mIoU30	527
Semantic segmentation	Cityscapes	mIoU32.1	494
Semantic segmentation	COCO Stuff	mIoU23.9	399
Semantic segmentation	PASCAL VOC (val)	mIoU80.9	380

Showing 10 of 129 rows

...

Other info

Follow for update

@wizwand_team Discord