ReCo: Retrieve and Co-segment for Zero-shot Transfer
About
Semantic segmentation has a broad range of applications, but its real-world impact has been significantly limited by the prohibitive annotation costs necessary to enable deployment. Segmentation methods that forgo supervision can side-step these costs, but exhibit the inconvenient requirement to provide labelled examples from the target distribution to assign concept names to predictions. An alternative line of work in language-image pre-training has recently demonstrated the potential to produce models that can both assign names across large vocabularies of concepts and enable zero-shot transfer for classification, but do not demonstrate commensurate segmentation abilities. In this work, we strive to achieve a synthesis of these two approaches that combines their strengths. We leverage the retrieval abilities of one such language-image pre-trained model, CLIP, to dynamically curate training sets from unlabelled images for arbitrary collections of concept names, and leverage the robust correspondences offered by modern image representations to co-segment entities among the resulting collections. The synthetic segment collections are then employed to construct a segmentation model (without requiring pixel labels) whose knowledge of concepts is inherited from the scalable pre-training process of CLIP. We demonstrate that our approach, termed Retrieve and Co-segment (ReCo) performs favourably to unsupervised segmentation approaches while inheriting the convenience of nameable predictions and zero-shot transfer. We also demonstrate ReCo's ability to generate specialist segmenters for extremely rare objects.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K (val) | mIoU11.2 | 2888 | |
| Semantic segmentation | PASCAL VOC 2012 (val) | Mean IoU25.1 | 2142 | |
| Semantic segmentation | ADE20K | mIoU11.2 | 1024 | |
| Semantic segmentation | Cityscapes | mIoU21.1 | 658 | |
| Semantic segmentation | Cityscapes (val) | mIoU19.3 | 572 | |
| Semantic segmentation | COCO Stuff | mIoU2.63e+3 | 379 | |
| Semantic segmentation | Cityscapes (val) | mIoU21.6 | 374 | |
| Semantic segmentation | ADE20K | mIoU11.2 | 366 | |
| Semantic segmentation | PASCAL VOC (val) | mIoU55.2 | 362 | |
| Semantic segmentation | PASCAL Context (val) | mIoU26.2 | 360 |