Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

About

While Contrastive Language-Image Pre-training (CLIP) has advanced open-vocabulary predictions, its performance on semantic segmentation remains suboptimal. This shortfall primarily stems from its spatial-invariant semantic features and constrained resolution. While previous adaptations addressed spatial invariance semantic by modifying the self-attention in CLIP's image encoder, the issue of limited resolution remains unexplored. Different from previous segment-then-splice methods that segment sub-images via a sliding window and splice the results, we introduce a splice-then-segment paradigm that incorporates Segment-Anything Model (SAM) to tackle the resolution issue since SAM excels at extracting fine-grained semantic correlations from high-resolution images. Specifically, we introduce Trident, a training-free framework that first splices features extracted by CLIP and DINO from sub-images, then leverages SAM's encoder to create a correlation matrix for global aggregation, enabling a broadened receptive field for effective segmentation. Besides, we propose a refinement strategy for CLIP's coarse segmentation outputs by transforming them into prompts for SAM, further enhancing the segmentation performance. Trident achieves a significant improvement in the mIoU across eight benchmarks compared with the current SOTA, increasing from 44.4 to 48.6.Code is available at https://github.com/YuHengsss/Trident.

Yuheng Shi, Minjing Dong, Chang Xu• 2024

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K
mIoU26.7
936
Semantic segmentationCityscapes
mIoU47.6
578
Semantic segmentationCOCO Stuff
mIoU0.286
195
Semantic segmentationPascal Context 59
mIoU44.3
164
Semantic segmentationPascal VOC 20
mIoU88.7
105
Semantic segmentationPascal VOC 21 classes (val)
mIoU70.8
103
Semantic segmentationPascal Context 60
mIoU40.1
81
Semantic segmentationCOCO Object
mIoU42.2
73
Open Vocabulary Semantic SegmentationCOCOStuff (val)
mIoU28.6
60
Open-Vocabulary SegmentationCityscapes
mIoU47.6
49
Showing 10 of 30 rows

Other info

Code

Follow for update