Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation

About

Recent advancements in pre-trained vision-language models like CLIP have enabled the task of open-vocabulary segmentation. CLIP demonstrates impressive zero-shot capabilities in various downstream tasks that require holistic image understanding. However, due to the image-level contrastive learning and fully global feature interaction, ViT-based CLIP struggles to capture local details, resulting in poor performance in segmentation tasks. Our analysis of ViT-based CLIP reveals that anomaly tokens emerge during the forward process, attracting disproportionate attention from normal patch tokens and thereby diminishing spatial awareness. To address this issue, we propose Self-Calibrated CLIP (SC-CLIP), a training-free method that calibrates CLIP to generate finer representations while preserving its original generalization ability-without introducing new parameters or relying on additional backbones. Specifically, we mitigate the negative impact of anomaly tokens from two complementary perspectives. First, we explicitly identify the anomaly tokens and replace them based on local context. Second, we reduce their influence on normal tokens by enhancing feature discriminability and attention correlation, leveraging the inherent semantic consistency within CLIP's mid-level features. In addition, we introduce a two-pass strategy that effectively integrates multi-level features to enrich local details under the training-free setting. Together, these strategies enhance CLIP's feature representations with improved granularity and semantic coherence. Experimental results demonstrate the effectiveness of SC-CLIP, achieving state-of-the-art results across all datasets and surpassing previous methods by 9.5%. Notably, SC-CLIP boosts the performance of vanilla CLIP ViT-L/14 by 6.8 times. Our source code is available at https://github.com/SuleBai/SC-CLIP.

Sule Bai, Yong Liu, Yifei Han, Haoji Zhang, Yansong Tang, Jie Zhou, Jiwen Lu• 2024

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K
mIoU21.7
1024
Semantic segmentationCityscapes
mIoU41.3
658
Semantic segmentationCOCO Stuff
mIoU27.25
379
Semantic segmentationADE20K
mIoU21.7
366
Semantic segmentationCityscapes
mIoU41.3
218
Semantic segmentationPascal Context
mIoU40.12
217
Semantic segmentationPascal Context 59
mIoU40.6
204
Semantic segmentationPC-59
mIoU40.6
148
Semantic segmentationPascal Context 60
mIoU36.9
139
Semantic segmentationPascal VOC 20
mIoU88.3
130
Showing 10 of 27 rows

Other info

Follow for update