Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Rethinking the Global Knowledge of CLIP in Training-Free Open-Vocabulary Semantic Segmentation

About

Recent works modify CLIP to perform open-vocabulary semantic segmentation in a training-free manner (TF-OVSS). In vanilla CLIP, patch-wise image representations mainly encode homogeneous image-level properties, which hinders the application of CLIP to the dense prediction task. Previous TF-OVSS works sacrifice globality to enhance the locality of CLIP features, by making each patch mainly attend to itself or its neighboring patches within a narrow local window. With their modifications,the ability of CLIP to aggregate global context information is largely weakened. Differently, in this paper, we rethink the global knowledge encoded by CLIP and propose GCLIP to answer how to extract and utilize beneficial global knowledge of CLIP for TF-OVSS. As the representation of each patch is finally determined by the attention weights and the Value embeddings, we propose to reshape the last-block attention and Value embeddings to aggregate useful global context into final features. Firstly, we aim to equip the last-block attention with image-level properties while not introducing homogeneous attention patterns across patches. To realize the goal, we fuse the attention from the global-token emerging blocks with the Query-Query attention. Secondly, we aim to make Value embeddings of the last-block attention module more semantically correlated. To realize this, we design a novel channel suppression strategy.Extensive experiments on five standard benchmarks demonstrate that our method consistently outperforms previous state-of-the-arts.

Jingyun Wang, Cilin Yan, Guoliang Kang• 2025

Related benchmarks

TaskDatasetResultRank
Open Vocabulary Semantic SegmentationCityscapes
mIoU33.7
81
Open Vocabulary Semantic SegmentationADE20K
mIoU18.5
80
Open Vocabulary Semantic SegmentationCOCO Stuff
mIoU24.8
48
Open-Vocabulary SegmentationPascal Context
mIoU36.8
33
Open Vocabulary Semantic SegmentationPascal VOC
mIoU81.3
27
Open Vocabulary Semantic SegmentationAggregate VOC, Context, ADE, Cityscapes, COCO Stuff
Average mIoU39
11
Showing 6 of 6 rows

Other info

Follow for update