Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation

About

Recent advancements in pre-trained vision-language models like CLIP have enabled the task of open-vocabulary segmentation. CLIP demonstrates impressive zero-shot capabilities in various downstream tasks that require holistic image understanding. However, due to the image-level contrastive learning and fully global feature interaction, ViT-based CLIP struggles to capture local details, resulting in poor performance in segmentation tasks. Our analysis of ViT-based CLIP reveals that anomaly tokens emerge during the forward process, attracting disproportionate attention from normal patch tokens and thereby diminishing spatial awareness. To address this issue, we propose Self-Calibrated CLIP (SC-CLIP), a training-free method that calibrates CLIP to generate finer representations while preserving its original generalization ability-without introducing new parameters or relying on additional backbones. Specifically, we mitigate the negative impact of anomaly tokens from two complementary perspectives. First, we explicitly identify the anomaly tokens and replace them based on local context. Second, we reduce their influence on normal tokens by enhancing feature discriminability and attention correlation, leveraging the inherent semantic consistency within CLIP's mid-level features. In addition, we introduce a two-pass strategy that effectively integrates multi-level features to enrich local details under the training-free setting. Together, these strategies enhance CLIP's feature representations with improved granularity and semantic coherence. Experimental results demonstrate the effectiveness of SC-CLIP, achieving state-of-the-art results across all datasets and surpassing previous methods by 9.5%. Notably, SC-CLIP boosts the performance of vanilla CLIP ViT-L/14 by 6.8 times. Our source code is available at https://github.com/SuleBai/SC-CLIP.

Sule Bai, Yong Liu, Yifei Han, Haoji Zhang, Yansong Tang, Jie Zhou, Jiwen Lu• 2024

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K	mIoU21.7	1028
Semantic segmentation	Cityscapes	mIoU41.3	668
Semantic segmentation	ADE20K	mIoU21.7	559
Semantic segmentation	Cityscapes	mIoU41.3	494
Semantic segmentation	COCO Stuff	mIoU27.25	399
Semantic segmentation	Pascal Context	mIoU40.12	217
Semantic segmentation	Pascal Context 59	mIoU40.6	204
Semantic segmentation	Potsdam (test)	mIoU36.6	193
Semantic segmentation	PC-59	mIoU40.6	174
Semantic segmentation	Pascal VOC	mIoU87.67	159

Showing 10 of 53 rows

Other info

Follow for update

@wizwand_team Discord