Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation

About

We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding, for open-vocabulary semantic segmentation. Plenty of the previous art casts this task as pixel-to-text classification without object-level comprehension, leveraging the image-to-text classification capability of pretrained vision-and-language models. We argue that visual objects are distinguishable without the prior text information as segmentation is essentially a vision task. Lazy visual grounding first discovers object masks covering an image with iterative Normalized cuts and then later assigns text on the discovered objects in a late interaction manner. Our model requires no additional training yet shows great performance on five public datasets: Pascal VOC, Pascal Context, COCO-object, COCO-stuff, and ADE 20K. Especially, the visually appealing segmentation results demonstrate the model capability to localize objects precisely. Paper homepage: https://cvlab.postech.ac.kr/research/lazygrounding

Dahyun Kang, Minsu Cho• 2024

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU15.8
2731
Semantic segmentationADE20K
mIoU15.8
936
Semantic segmentationCityscapes
mIoU26.2
578
Semantic segmentationCityscapes (val)
mIoU26.2
332
Semantic segmentationCOCO Stuff
mIoU0.232
195
Semantic segmentationPascal Context 59
mIoU34.7
164
Semantic segmentationCOCO Stuff (val)
mIoU23.2
126
Semantic segmentationPASCAL-Context 59 class (val)
mIoU34.7
125
Semantic segmentationPascal VOC 20
mIoU82.5
105
Semantic segmentationPascal VOC 21 classes (val)
mIoU62.1
103
Showing 10 of 31 rows

Other info

Code

Follow for update