Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference

About

Pursuing training-free open-vocabulary semantic segmentation in an efficient and generalizable manner remains challenging due to the deep-seated spatial bias in CLIP. To overcome the limitations of existing solutions, this work moves beyond the CLIP-based paradigm and harnesses the recent spatially-aware dino$.$txt framework to facilitate more efficient and high-quality dense prediction. While dino$.$txt exhibits robust spatial awareness, we find that the semantic ambiguity of text queries gives rise to severe mismatch within its dense cross-modal interactions. To address this, we introduce Visual-guided Prompt evolution (VIP) to rectify the semantic expressiveness of text queries in dino$.$txt, unleashing its potential for fine-grained object perception. Towards this end, VIP integrates alias expansion with a visual-guided distillation mechanism to mine valuable semantic cues, which are robustly aggregated in a saliency-aware manner to yield a high-fidelity prediction. Extensive evaluations demonstrate that VIP: 1. surpasses the top-leading methods by 1.4%-8.4% average mIoU, 2. generalizes well to diverse challenging domains, and 3. requires marginal inference time and memory overhead.

Hao Zhu, Shuo Jin, Wenbin Liao, Jiayu Xiao, Yan Zhu, Siyue Yu, Feng Dai• 2026

Related benchmarks

TaskDatasetResultRank
Semantic segmentationVaihingen
mIoU47
156
Semantic segmentationiSAID
mIoU26.1
146
Semantic segmentationPotsdam
mIoU49.8
110
Open Vocabulary Semantic SegmentationCOCO Stuff without background
mIoU52.4
90
Open Vocabulary Semantic SegmentationCOCO Object with background
mIoU52.4
87
Semantic segmentationVDD
mIoU54.3
87
Open Vocabulary Semantic SegmentationCityscapes
mIoU55.7
81
Open Vocabulary Semantic SegmentationADE20K
mIoU29.1
80
Open Vocabulary Semantic SegmentationPascal VOC 21 (With Background)
mIoU73.2
39
Open Vocabulary Semantic SegmentationPascal VOC 20 without background
mIoU92.5
38
Showing 10 of 12 rows

Other info

Follow for update