Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation

About

CLIP exhibits strong visual-textual alignment but struggle with open-vocabulary segmentation due to poor localization. Prior methods enhance spatial coherence by modifying intermediate attention. But, this coherence isn't consistently propagated to the final output due to subsequent operations such as projections. Additionally, intermediate attention lacks direct interaction with text representations, such semantic discrepancy limits the full potential of CLIP. In this work, we propose a training-free, feedback-driven self-adaptive framework that adapts output-based patch-level correspondences back to the intermediate attention. The output predictions, being the culmination of the model's processing, encapsulate the most comprehensive visual and textual semantics about each patch. Our approach enhances semantic consistency between internal representations and final predictions by leveraging the model's outputs as a stronger spatial coherence prior. We design key modules, including attention isolation, confidence-based pruning for sparse adaptation, and adaptation ensemble, to effectively feedback the output coherence cues. Our method functions as a plug-in module, seamlessly integrating into four state-of-the-art approaches with three backbones (ViT-B, ViT-L, ViT-H). We further validate our framework across multiple attention types (Q-K, self-self, and Proxy augmented with MAE, SAM, and DINO). Our approach consistently improves their performance across eight benchmarks.

Zhixiang Chi, Yanan Wu, Li Gu, Huan Liu, Ziqiang Wang, Yang Zhang, Yang Wang, Konstantinos N. Plataniotis• 2025

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K
mIoU24.5
559
Semantic segmentationCityscapes
mIoU43.6
494
Open Vocabulary Semantic SegmentationCOCO Stuff without background
mIoU43.3
90
Open Vocabulary Semantic SegmentationCOCO Object with background
mIoU43.3
87
Open Vocabulary Semantic SegmentationCityscapes
mIoU38.8
81
Open Vocabulary Semantic SegmentationADE20K
mIoU20.5
80
Semantic segmentationPASCAL VOC with background category VOC21 2012
mIoU67.9
51
Semantic segmentationPascal Context 60 with background
mIoU40.2
43
Semantic segmentationCOCO-Stuff without background class
mIoU40.5
42
Semantic segmentationPascal VOC without background 2012 V20
mIoU85.7
42
Showing 10 of 16 rows

Other info

Follow for update