Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs

About

We tackle open-world semantic segmentation, which aims at learning to segment arbitrary visual concepts in images, by using only image-text pairs without dense annotations. Existing open-world segmentation methods have shown impressive advances by employing contrastive learning (CL) to learn diverse visual concepts and transferring the learned image-level understanding to the segmentation task. However, these CL-based methods suffer from a train-test discrepancy, since it only considers image-text alignment during training, whereas segmentation requires region-text alignment during testing. In this paper, we proposed a novel Text-grounded Contrastive Learning (TCL) framework that enables a model to directly learn region-text alignment. Our method generates a segmentation mask for a given text, extracts text-grounded image embedding from the masked region, and aligns it with text embedding via TCL. By learning region-text alignment directly, our framework encourages a model to directly improve the quality of generated segmentation masks. In addition, for a rigorous and fair comparison, we present a unified evaluation protocol with widely used 8 semantic segmentation datasets. TCL achieves state-of-the-art zero-shot segmentation performances with large margins in all datasets. Code is available at https://github.com/kakaobrain/tcl.

Junbum Cha, Jonghwan Mun, Byungseok Roh• 2022

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU17.1
2731
Semantic segmentationPASCAL VOC 2012 (val)
Mean IoU55
2040
Semantic segmentationCityscapes (test)
mIoU23.1
1145
Semantic segmentationADE20K
mIoU17.1
936
Semantic segmentationCityscapes
mIoU24
578
Semantic segmentationCityscapes (val)
mIoU23.1
572
Semantic segmentationPASCAL VOC (val)
mIoU77.5
338
Semantic segmentationCityscapes (val)
mIoU24
332
Semantic segmentationPASCAL Context (val)
mIoU33.8
323
Semantic segmentationPascal VOC (test)
mIoU77.5
236
Showing 10 of 92 rows
...

Other info

Code

Follow for update