VIP: Visual-guided Prompt Evolution for Efficient Dense Vision-Language Inference
About
Pursuing training-free open-vocabulary semantic segmentation in an efficient and generalizable manner remains challenging due to the deep-seated spatial bias in CLIP. To overcome the limitations of existing solutions, this work moves beyond the CLIP-based paradigm and harnesses the recent spatially-aware dino$.$txt framework to facilitate more efficient and high-quality dense prediction. While dino$.$txt exhibits robust spatial awareness, we find that the semantic ambiguity of text queries gives rise to severe mismatch within its dense cross-modal interactions. To address this, we introduce Visual-guided Prompt evolution (VIP) to rectify the semantic expressiveness of text queries in dino$.$txt, unleashing its potential for fine-grained object perception. Towards this end, VIP integrates alias expansion with a visual-guided distillation mechanism to mine valuable semantic cues, which are robustly aggregated in a saliency-aware manner to yield a high-fidelity prediction. Extensive evaluations demonstrate that VIP: 1. surpasses the top-leading methods by 1.4%-8.4% average mIoU, 2. generalizes well to diverse challenging domains, and 3. requires marginal inference time and memory overhead.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | Vaihingen | mIoU47 | 156 | |
| Semantic segmentation | iSAID | mIoU26.1 | 146 | |
| Semantic segmentation | Potsdam | mIoU49.8 | 110 | |
| Open Vocabulary Semantic Segmentation | COCO Stuff without background | mIoU52.4 | 90 | |
| Open Vocabulary Semantic Segmentation | COCO Object with background | mIoU52.4 | 87 | |
| Semantic segmentation | VDD | mIoU54.3 | 87 | |
| Open Vocabulary Semantic Segmentation | Cityscapes | mIoU55.7 | 81 | |
| Open Vocabulary Semantic Segmentation | ADE20K | mIoU29.1 | 80 | |
| Open Vocabulary Semantic Segmentation | Pascal VOC 21 (With Background) | mIoU73.2 | 39 | |
| Open Vocabulary Semantic Segmentation | Pascal VOC 20 without background | mIoU92.5 | 38 |