CRIS: CLIP-Driven Referring Image Segmentation
About
Referring image segmentation aims to segment a referent via a natural linguistic expression.Due to the distinct data properties between text and image, it is challenging for a network to well align text and pixel-level features. Existing approaches use pretrained models to facilitate learning, yet separately transfer the language/vision knowledge from pretrained models, ignoring the multi-modal corresponding information. Inspired by the recent advance in Contrastive Language-Image Pretraining (CLIP), in this paper, we propose an end-to-end CLIP-Driven Referring Image Segmentation framework (CRIS). To transfer the multi-modal knowledge effectively, CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. More specifically, we design a vision-language decoder to propagate fine-grained semantic information from textual representations to each pixel-level activation, which promotes consistency between the two modalities. In addition, we present text-to-pixel contrastive learning to explicitly enforce the text feature similar to the related pixel-level features and dissimilar to the irrelevances. The experimental results on three benchmark datasets demonstrate that our proposed framework significantly outperforms the state-of-the-art performance without any post-processing. The code will be released.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Referring Expression Segmentation | RefCOCO (testA) | cIoU73.2 | 217 | |
| Referring Expression Segmentation | RefCOCO+ (val) | cIoU65.3 | 201 | |
| Referring Image Segmentation | RefCOCO+ (test-B) | mIoU53.7 | 200 | |
| Referring Image Segmentation | RefCOCO (val) | mIoU70.47 | 197 | |
| Referring Expression Segmentation | RefCOCO (testB) | cIoU66.1 | 191 | |
| Referring Expression Segmentation | RefCOCO+ (testA) | cIoU68.1 | 190 | |
| Referring Expression Segmentation | RefCOCO (val) | cIoU70.5 | 190 | |
| Referring Expression Segmentation | RefCOCO+ (testB) | cIoU53.7 | 188 | |
| Referring Image Segmentation | RefCOCO (test A) | mIoU73.2 | 178 | |
| Medical Image Segmentation | BUSI (test) | Dice67.5 | 121 |