Effective SAM Combination for Open-Vocabulary Semantic Segmentation

About

Open-vocabulary semantic segmentation aims to assign pixel-level labels to images across an unlimited range of classes. Traditional methods address this by sequentially connecting a powerful mask proposal generator, such as the Segment Anything Model (SAM), with a pre-trained vision-language model like CLIP. But these two-stage approaches often suffer from high computational costs, memory inefficiencies. In this paper, we propose ESC-Net, a novel one-stage open-vocabulary segmentation model that leverages the SAM decoder blocks for class-agnostic segmentation within an efficient inference framework. By embedding pseudo prompts generated from image-text correlations into SAM's promptable segmentation framework, ESC-Net achieves refined spatial aggregation for accurate mask predictions. ESC-Net achieves superior performance on standard benchmarks, including ADE20K, PASCAL-VOC, and PASCAL-Context, outperforming prior methods in both efficiency and accuracy. Comprehensive ablation studies further demonstrate its robustness across challenging conditions.

Minhyeok Lee, Suhwan Cho, Jungho Lee, Sunghun Yang, Heeseung Choi, Ig-Jae Kim, Sangyoun Lee• 2024

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K	mIoU35.6	1028
Semantic segmentation	Pascal Context 59	mIoU59	204
Semantic segmentation	PC-59	mIoU65.6	174
Semantic segmentation	Pascal VOC 20	mIoU97.3	130
Open Vocabulary Semantic Segmentation	Pascal VOC 20	mIoU98.3	113
Semantic segmentation	Pascal VOC 21 classes (val)	mIoU80.1	103
Open Vocabulary Semantic Segmentation	Pascal Context PC-59	mIoU65.6	99
Semantic segmentation	PC-459	mIoU27	94
Open Vocabulary Semantic Segmentation	ADE20K A-150	mIoU41.8	79
Semantic segmentation	A-150	mIoU41.8	67

Showing 10 of 24 rows

Other info

Follow for update

@wizwand_team Discord