Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation

About

Open-vocabulary semantic segmentation requires models to effectively integrate visual representations with open-vocabulary semantic labels. While Contrastive Language-Image Pre-training (CLIP) models shine in recognizing visual concepts from text, they often struggle with segment coherence due to their limited localization ability. In contrast, Vision Foundation Models (VFMs) excel at acquiring spatially consistent local visual representations, yet they fall short in semantic understanding. This paper introduces ProxyCLIP, an innovative framework designed to harmonize the strengths of both CLIP and VFMs, facilitating enhanced open-vocabulary semantic segmentation. ProxyCLIP leverages the spatial feature correspondence from VFMs as a form of proxy attention to augment CLIP, thereby inheriting the VFMs' robust local consistency and maintaining CLIP's exceptional zero-shot transfer capacity. We propose an adaptive normalization and masking strategy to get the proxy attention from VFMs, allowing for adaptation across different VFMs. Remarkably, as a training-free approach, ProxyCLIP significantly improves the average mean Intersection over Union (mIoU) across eight benchmarks from 40.3 to 44.4, showcasing its exceptional efficacy in bridging the gap between spatial precision and semantic richness for the open-vocabulary segmentation task.

Mengcheng Lan, Chaofeng Chen, Yiping Ke, Xinjiang Wang, Litong Feng, Wayne Zhang• 2024

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU19.6
2731
Semantic segmentationADE20K
mIoU24.2
936
Semantic segmentationCityscapes
mIoU42
578
Semantic segmentationCityscapes (val)
mIoU38.1
332
Semantic segmentationCOCO Stuff
mIoU26.8
195
Semantic segmentationPascal VOC
mIoU0.65
172
Semantic segmentationPascal Context 59
mIoU39.6
164
Semantic segmentationLoveDA
mIoU34.3
142
Semantic segmentationCOCO Stuff (val)
mIoU26.2
126
Semantic segmentationPASCAL-Context 59 class (val)
mIoU38.8
125
Showing 10 of 69 rows

Other info

Code

Follow for update