Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation

About

CLIP, as a vision-language model, has significantly advanced Open-Vocabulary Semantic Segmentation (OVSS) with its zero-shot capabilities. Despite its success, its application to OVSS faces challenges due to its initial image-level alignment training, which affects its performance in tasks requiring detailed local context. Our study delves into the impact of CLIP's [CLS] token on patch feature correlations, revealing a dominance of "global" patches that hinders local feature discrimination. To overcome this, we propose CLIPtrase, a novel training-free semantic segmentation strategy that enhances local feature awareness through recalibrated self-correlation among patches. This approach demonstrates notable improvements in segmentation accuracy and the ability to maintain semantic coherence across objects.Experiments show that we are 22.3% ahead of CLIP on average on 9 segmentation benchmarks, outperforming existing state-of-the-art training-free methods.The code are made publicly available at: https://github.com/leaves162/CLIPtrase.

Tong Shao, Zhuotao Tian, Hang Zhao, Jingyong Su• 2024

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU16.4
2731
Semantic segmentationADE20K
mIoU17
936
Semantic segmentationCityscapes--
578
Semantic segmentationCityscapes (val)
mIoU21.3
332
Semantic segmentationCOCO Stuff
mIoU0.241
195
Semantic segmentationPascal Context 59
mIoU34.9
164
Semantic segmentationCOCO Stuff (val)
mIoU22.8
126
Semantic segmentationPASCAL-Context 59 class (val)
mIoU33.8
125
Semantic segmentationPascal VOC 20
mIoU81.2
105
Semantic segmentationPascal VOC 21 classes (val)
mIoU53
103
Showing 10 of 38 rows

Other info

Follow for update