Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Effective SAM Combination for Open-Vocabulary Semantic Segmentation

About

Open-vocabulary semantic segmentation aims to assign pixel-level labels to images across an unlimited range of classes. Traditional methods address this by sequentially connecting a powerful mask proposal generator, such as the Segment Anything Model (SAM), with a pre-trained vision-language model like CLIP. But these two-stage approaches often suffer from high computational costs, memory inefficiencies. In this paper, we propose ESC-Net, a novel one-stage open-vocabulary segmentation model that leverages the SAM decoder blocks for class-agnostic segmentation within an efficient inference framework. By embedding pseudo prompts generated from image-text correlations into SAM's promptable segmentation framework, ESC-Net achieves refined spatial aggregation for accurate mask predictions. ESC-Net achieves superior performance on standard benchmarks, including ADE20K, PASCAL-VOC, and PASCAL-Context, outperforming prior methods in both efficiency and accuracy. Comprehensive ablation studies further demonstrate its robustness across challenging conditions.

Minhyeok Lee, Suhwan Cho, Jungho Lee, Sunghun Yang, Heeseung Choi, Ig-Jae Kim, Sangyoun Lee• 2024

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K
mIoU35.6
936
Semantic segmentationPascal Context 59
mIoU59
164
Semantic segmentationPascal VOC 20
mIoU97.3
105
Semantic segmentationPascal VOC 21 classes (val)
mIoU80.1
103
Open Vocabulary Semantic SegmentationPascal VOC 20
mIoU98.3
62
Open Vocabulary Semantic SegmentationADE-847
mIoU18.1
59
Open Vocabulary Semantic SegmentationPascal Context PC-59
mIoU65.6
57
Open Vocabulary Semantic SegmentationADE20K A-150
mIoU41.8
54
Semantic segmentationDv 19-class (val)
ACDC-19 Score46.3
46
Semantic segmentationDv 58-class (val)
ACDC-4157.1
46
Showing 10 of 15 rows

Other info

Follow for update