CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor
About
Existing open-vocabulary image segmentation methods require a fine-tuning step on mask labels and/or image-text datasets. Mask labels are labor-intensive, which limits the number of categories in segmentation datasets. Consequently, the vocabulary capacity of pre-trained VLMs is severely reduced after fine-tuning. However, without fine-tuning, VLMs trained under weak image-text supervision tend to make suboptimal mask predictions. To alleviate these issues, we introduce a novel recurrent framework that progressively filters out irrelevant texts and enhances mask quality without training efforts. The recurrent unit is a two-stage segmenter built upon a frozen VLM. Thus, our model retains the VLM's broad vocabulary space and equips it with segmentation ability. Experiments show that our method outperforms not only the training-free counterparts, but also those fine-tuned with millions of data samples, and sets the new state-of-the-art records for both zero-shot semantic and referring segmentation. Concretely, we improve the current record by 28.8, 16.0, and 6.9 mIoU on Pascal VOC, COCO Object, and Pascal Context.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K (val) | mIoU5.4 | 2731 | |
| Semantic segmentation | ADE20K | mIoU17.7 | 936 | |
| Semantic segmentation | Pascal Context 59 | mIoU39.5 | 164 | |
| Semantic segmentation | PASCAL-Context 59 class (val) | mIoU18.4 | 125 | |
| Referring Expression Segmentation | RefCOCOg (val) | cIoU36.6 | 107 | |
| Semantic segmentation | Pascal VOC 20 | mIoU91.4 | 105 | |
| Semantic segmentation | Pascal VOC 21 classes (val) | mIoU67.6 | 103 | |
| Semantic segmentation | Pascal Context 60 | mIoU30.5 | 81 | |
| Semantic segmentation | COCO Object (val) | mIoU0.154 | 77 | |
| Semantic segmentation | COCO Object | mIoU36.6 | 73 |