Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor

About

Existing open-vocabulary image segmentation methods require a fine-tuning step on mask labels and/or image-text datasets. Mask labels are labor-intensive, which limits the number of categories in segmentation datasets. Consequently, the vocabulary capacity of pre-trained VLMs is severely reduced after fine-tuning. However, without fine-tuning, VLMs trained under weak image-text supervision tend to make suboptimal mask predictions. To alleviate these issues, we introduce a novel recurrent framework that progressively filters out irrelevant texts and enhances mask quality without training efforts. The recurrent unit is a two-stage segmenter built upon a frozen VLM. Thus, our model retains the VLM's broad vocabulary space and equips it with segmentation ability. Experiments show that our method outperforms not only the training-free counterparts, but also those fine-tuned with millions of data samples, and sets the new state-of-the-art records for both zero-shot semantic and referring segmentation. Concretely, we improve the current record by 28.8, 16.0, and 6.9 mIoU on Pascal VOC, COCO Object, and Pascal Context.

Shuyang Sun, Runjia Li, Philip Torr, Xiuye Gu, Siyang Li• 2023

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU5.4
2731
Semantic segmentationADE20K
mIoU17.7
936
Semantic segmentationPascal Context 59
mIoU39.5
164
Semantic segmentationPASCAL-Context 59 class (val)
mIoU18.4
125
Referring Expression SegmentationRefCOCOg (val)
cIoU36.6
107
Semantic segmentationPascal VOC 20
mIoU91.4
105
Semantic segmentationPascal VOC 21 classes (val)
mIoU67.6
103
Semantic segmentationPascal Context 60
mIoU30.5
81
Semantic segmentationCOCO Object (val)
mIoU0.154
77
Semantic segmentationCOCO Object
mIoU36.6
73
Showing 10 of 21 rows

Other info

Follow for update