DPSeg: Dual-Prompt Cost Volume Learning for Open-Vocabulary Semantic Segmentation

About

Open-vocabulary semantic segmentation aims to segment images into distinct semantic regions for both seen and unseen categories at the pixel level. Current methods utilize text embeddings from pre-trained vision-language models like CLIP but struggle with the inherent domain gap between image and text embeddings, even after extensive alignment during training. Additionally, relying solely on deep text-aligned features limits shallow-level feature guidance, which is crucial for detecting small objects and fine details, ultimately reducing segmentation accuracy. To address these limitations, we propose a dual prompting framework, DPSeg, for this task. Our approach combines dual-prompt cost volume generation, a cost volume-guided decoder, and a semantic-guided prompt refinement strategy that leverages our dual prompting scheme to mitigate alignment issues in visual prompt generation. By incorporating visual embeddings from a visual prompt encoder, our approach reduces the domain gap between text and image embeddings while providing multi-level guidance through shallow features. Extensive experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches on multiple public datasets.

Ziyu Zhao, Xiaoguang Li, Linjia Shi, Nasrin Imanpour, Song Wang• 2025

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K A-150	mIoU36.4	224
Semantic segmentation	Pascal Context 59	mIoU62	204
Semantic segmentation	PC-59	mIoU62.3	174
Semantic segmentation	PASCAL-Context 59 class (val)	mIoU62	125
Semantic segmentation	ADE20K 847	mIoU1.49e+3	105
Semantic segmentation	PC-459	mIoU24.1	94
Semantic segmentation	Pascal Context 459	mIoU23.5	82
Open Vocabulary Semantic Segmentation	ADE20K A-150	mIoU37.1	79
Semantic segmentation	PASCAL-Context 59 classes (test)	mIoU62.3	75
Semantic segmentation	PASCAL-Context PC-459	mIoU24.1	69

Showing 10 of 24 rows

Other info

Follow for update

@wizwand_team Discord