Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation

About

Recent advancements in pre-trained vision-language models like CLIP, have enabled the task of open-vocabulary segmentation. CLIP demonstrates impressive zero-shot capabilities in various downstream tasks that require holistic image understanding. However, due to its image-level pre-training, CLIP struggles to capture local details, resulting in poor performance in segmentation tasks. Our analysis reveals that anomaly tokens emerge during the forward pass, drawing excessive attention from normal patch tokens, thereby diminishing spatial awareness. To address this issue, we propose Self-Calibrated CLIP (SC-CLIP), a training-free method that calibrates CLIP to produce finer representations while preserving its original generalization ability, without introducing new parameters or relying on additional backbones. Specifically, we first identify and resolve the anomaly tokens to mitigate their negative impact. Next, we enhance feature discriminability and attention correlation by leveraging the semantic consistency found in CLIP's intermediate features. Furthermore, we explore how to effectively employ multi-level feature fusion under the training-free setting. Collectively, these strategies enhance CLIP's feature representation with greater granularity and coherence. Experimental results demonstrate the effectiveness of SC-CLIP, achieving state-of-the-art results across all datasets and surpassing previous methods by 9.5%. Notably, SC-CLIP boosts the performance of vanilla CLIP ViT-L/14 by 6.8 times. Our source code is available at https://github.com/SuleBai/SC-CLIP.

Sule Bai, Yong Liu, Yifei Han, Haoji Zhang, Yansong Tang• 2024

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K
mIoU21.7
936
Semantic segmentationCityscapes
mIoU41.3
578
Semantic segmentationCOCO Stuff
mIoU0.269
195
Semantic segmentationPascal Context 59
mIoU40.6
164
Semantic segmentationPascal VOC 20
mIoU88.3
105
Semantic segmentationPascal VOC 21 classes (val)
mIoU65
103
Semantic segmentationPascal Context 60
mIoU36.9
81
Semantic segmentationCOCO Object
mIoU40.5
73
Open Vocabulary Semantic SegmentationCOCOStuff (val)
mIoU26.9
60
Open-Vocabulary SegmentationCityscapes
mIoU41.3
49
Showing 10 of 11 rows

Other info

Follow for update