Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models

About

From image-text pairs, large-scale vision-language models (VLMs) learn to implicitly associate image regions with words, which prove effective for tasks like visual question answering. However, leveraging the learned association for open-vocabulary semantic segmentation remains a challenge. In this paper, we propose a simple, yet extremely effective, training-free technique, Plug-and-Play Open-Vocabulary Semantic Segmentation (PnP-OVSS) for this task. PnP-OVSS leverages a VLM with direct text-to-image cross-attention and an image-text matching loss. To balance between over-segmentation and under-segmentation, we introduce Salience Dropout; by iteratively dropping patches that the model is most attentive to, we are able to better resolve the entire extent of the segmentation mask. PnP-OVSS does not require any neural network training and performs hyperparameter tuning without the need for any segmentation annotations, even for a validation set. PnP-OVSS demonstrates substantial improvements over comparable baselines (+26.2% mIoU on Pascal VOC, +20.5% mIoU on MS COCO, +3.1% mIoU on COCO Stuff and +3.0% mIoU on ADE20K). Our codebase is at https://github.com/letitiabanana/PnP-OVSS.

Jiayun Luo, Siddhesh Khandelwal, Leonid Sigal, Boyang Li• 2023

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU14.2
2888
Semantic segmentationCOCO Stuff
mIoU17.9
379
Semantic segmentationADE20K
mIoU14.2
366
Semantic segmentationPC-59
mIoU28
148
Semantic segmentationCOCO Object
mIoU36.2
129
Semantic segmentationCOCO Stuff (val)
mIoU17.9
126
Semantic segmentationPASCAL-Context 59 class (val)
mIoU28
125
Semantic segmentationVOC-20
mIoU51.3
118
Semantic segmentationCOCO Object (val)
mIoU0.362
97
Open Vocabulary Semantic SegmentationADE20K without background
mIoU14.2
72
Showing 10 of 15 rows

Other info

Code

Follow for update