Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models

About

From image-text pairs, large-scale vision-language models (VLMs) learn to implicitly associate image regions with words, which prove effective for tasks like visual question answering. However, leveraging the learned association for open-vocabulary semantic segmentation remains a challenge. In this paper, we propose a simple, yet extremely effective, training-free technique, Plug-and-Play Open-Vocabulary Semantic Segmentation (PnP-OVSS) for this task. PnP-OVSS leverages a VLM with direct text-to-image cross-attention and an image-text matching loss. To balance between over-segmentation and under-segmentation, we introduce Salience Dropout; by iteratively dropping patches that the model is most attentive to, we are able to better resolve the entire extent of the segmentation mask. PnP-OVSS does not require any neural network training and performs hyperparameter tuning without the need for any segmentation annotations, even for a validation set. PnP-OVSS demonstrates substantial improvements over comparable baselines (+26.2% mIoU on Pascal VOC, +20.5% mIoU on MS COCO, +3.1% mIoU on COCO Stuff and +3.0% mIoU on ADE20K). Our codebase is at https://github.com/letitiabanana/PnP-OVSS.

Jiayun Luo, Siddhesh Khandelwal, Leonid Sigal, Boyang Li• 2023

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU14.2	3069
Semantic segmentation	ADE20K	mIoU14.2	559
Semantic segmentation	COCO Stuff	mIoU17.9	399
Semantic segmentation	PC-59	mIoU28	174
Semantic segmentation	COCO Stuff (val)	mIoU17.9	167
Semantic segmentation	COCO Object	mIoU36.2	139
Semantic segmentation	PASCAL-Context 59 class (val)	mIoU28	125
Semantic segmentation	VOC-20	mIoU51.3	118
Semantic segmentation	COCO Object (val)	mIoU0.362	101
Open Vocabulary Semantic Segmentation	COCO Stuff without background	mIoU17.9	90

Showing 10 of 15 rows

Other info

Code

Follow for update

@wizwand_team Discord