OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation

About

The recent success of CLIP has demonstrated promising results in zero-shot semantic segmentation by transferring muiltimodal knowledge to pixel-level classification. However, leveraging pre-trained CLIP knowledge to closely align text embeddings with pixel embeddings still has limitations in existing approaches. To address this issue, we propose OTSeg, a novel multimodal attention mechanism aimed at enhancing the potential of multiple text prompts for matching associated pixel embeddings. We first propose Multi-Prompts Sinkhorn (MPS) based on the Optimal Transport (OT) algorithm, which leads multiple text prompts to selectively focus on various semantic features within image pixels. Moreover, inspired by the success of Sinkformers in unimodal settings, we introduce the extension of MPS, called Multi-Prompts Sinkhorn Attention (MPSA) , which effectively replaces cross-attention mechanisms within Transformer framework in multimodal settings. Through extensive experiments, we demonstrate that OTSeg achieves state-of-the-art (SOTA) performance with significant gains on Zero-Shot Semantic Segmentation (ZS3) tasks across three benchmark datasets.

Kwanyoung Kim, Yujin Oh, Jong Chul Ye• 2024

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU21.9	3069
Semantic segmentation	PASCAL VOC 2012 (val)	Mean IoU94.4	2204
Semantic segmentation	PASCAL Context (val)	mIoU53.4	360
Semantic segmentation	Pascal Context (test)	--	223
Semantic segmentation	PASCAL-Context 59 class (val)	mIoU53.4	125
Semantic segmentation	COCO-Stuff 164K (test)	--	77
Semantic segmentation	COCOStuff 164k (val)	mIoU18.9	47
Semantic segmentation	VOC (val)	mIoU94.4	25
Semantic segmentation	VOC 2012	mIoU (Smoothed)94.3	23
Semantic segmentation	Efficiency benchmark NVIDIA 3090 GPU	GFLOPS61.9	5

Showing 10 of 10 rows

Other info

Code

Follow for update

@wizwand_team Discord