Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation

About

While Contrastive Language-Image Pre-training (CLIP) has advanced open-vocabulary predictions, its performance on semantic segmentation remains suboptimal. This shortfall primarily stems from its spatial-invariant semantic features and constrained resolution. While previous adaptations addressed spatial invariance semantic by modifying the self-attention in CLIP's image encoder, the issue of limited resolution remains unexplored. Different from previous segment-then-splice methods that segment sub-images via a sliding window and splice the results, we introduce a splice-then-segment paradigm that incorporates Segment-Anything Model (SAM) to tackle the resolution issue since SAM excels at extracting fine-grained semantic correlations from high-resolution images. Specifically, we introduce Trident, a training-free framework that first splices features extracted by CLIP and DINO from sub-images, then leverages SAM's encoder to create a correlation matrix for global aggregation, enabling a broadened receptive field for effective segmentation. Besides, we propose a refinement strategy for CLIP's coarse segmentation outputs by transforming them into prompts for SAM, further enhancing the segmentation performance. Trident achieves a significant improvement in the mIoU across eight benchmarks compared with the current SOTA, increasing from 44.4 to 48.6.Code is available at https://github.com/YuHengsss/Trident.

Yuheng Shi, Minjing Dong, Chang Xu• 2024

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K	mIoU26.7	1028
Semantic segmentation	Cityscapes	mIoU47.6	668
Semantic segmentation	ADE20K	mIoU27.02	559
Semantic segmentation	Cityscapes	mIoU47.6	494
Semantic segmentation	COCO Stuff	mIoU28.55	399
Semantic segmentation	Pascal Context	mIoU44.32	217
Semantic segmentation	Pascal Context 59	mIoU44.3	204
Semantic segmentation	PC-59	mIoU46.1	174
Semantic segmentation	Pascal VOC	mIoU88.67	159
Semantic segmentation	Vaihingen	mIoU27.7	156

Showing 10 of 75 rows

...

Other info

Code

Follow for update

@wizwand_team Discord