CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation

About

Open-vocabulary semantic segmentation presents the challenge of labeling each pixel within an image based on a wide range of text descriptions. In this work, we introduce a novel cost-based approach to adapt vision-language foundation models, notably CLIP, for the intricate task of semantic segmentation. Through aggregating the cosine similarity score, i.e., the cost volume between image and text embeddings, our method potently adapts CLIP for segmenting seen and unseen classes by fine-tuning its encoders, addressing the challenges faced by existing methods in handling unseen classes. Building upon this, we explore methods to effectively aggregate the cost volume considering its multi-modal nature of being established between image and text embeddings. Furthermore, we examine various methods for efficiently fine-tuning CLIP.

Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, Seungryong Kim• 2023

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K	mIoU31.8	1028
Semantic segmentation	PASCAL Context (val)	mIoU62	360
Semantic segmentation	ADE20K A-150	mIoU37.9	224
Semantic segmentation	Pascal Context 59	mIoU63.3	204
Semantic segmentation	PC-59	mIoU63.3	174
Semantic segmentation	Vaihingen	mIoU42.3	156
Semantic segmentation	iSAID	mIoU94.77	146
Medical Image Segmentation	BUSI	Dice Score81.83	134
Semantic segmentation	Pascal VOC 20	mIoU97	130
Semantic segmentation	PASCAL-Context 59 class (val)	mIoU63.3	125

Showing 10 of 194 rows

...

Other info

Code

Follow for update

@wizwand_team Discord