Open-Vocabulary Universal Image Segmentation with MaskCLIP

About

In this paper, we tackle an emerging computer vision task, open-vocabulary universal image segmentation, that aims to perform semantic/instance/panoptic segmentation (background semantic labeling + foreground instance segmentation) for arbitrary categories of text-based descriptions in inference time. We first build a baseline method by directly adopting pre-trained CLIP models without finetuning or distillation. We then develop MaskCLIP, a Transformer-based approach with a MaskCLIP Visual Encoder, which is an encoder-only module that seamlessly integrates mask tokens with a pre-trained ViT CLIP model for semantic/instance segmentation and class prediction. MaskCLIP learns to efficiently and effectively utilize pre-trained partial/dense CLIP features within the MaskCLIP Visual Encoder that avoids the time-consuming student-teacher training process. MaskCLIP outperforms previous methods for semantic/instance/panoptic segmentation on ADE20K and PASCAL datasets. We show qualitative illustrations for MaskCLIP with online custom categories. Project website: https://maskclip.github.io.

Zheng Ding, Jieke Wang, Zhuowen Tu• 2022

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K	mIoU23.7	1028
Semantic segmentation	Cityscapes	mIoU17.7	668
Semantic segmentation	COCO Stuff	mIoU8.8	399
Semantic segmentation	Pascal VOC	mIoU0.388	280
Semantic segmentation	ADE20K A-150	mIoU23.7	224
Semantic segmentation	Pascal Context 59	mIoU45.9	204
Semantic segmentation	PC-59	mIoU45.9	174
Object Detection	LVIS (val)	mAP8.4	170
Semantic segmentation	COCO Object	mIoU20.6	139
Semantic segmentation	Pascal VOC 20	mIoU41.7	130

Showing 10 of 66 rows

Other info

Code

Follow for update

@wizwand_team Discord