Zero-shot Referring Image Segmentation with Global-Local Context Features
About
Referring image segmentation (RIS) aims to find a segmentation mask given a referring expression grounded to a region of the input image. Collecting labelled datasets for this task, however, is notoriously costly and labor-intensive. To overcome this issue, we propose a simple yet effective zero-shot referring image segmentation method by leveraging the pre-trained cross-modal knowledge from CLIP. In order to obtain segmentation masks grounded to the input text, we propose a mask-guided visual encoder that captures global and local contextual information of an input image. By utilizing instance masks obtained from off-the-shelf mask proposal techniques, our method is able to segment fine-detailed Istance-level groundings. We also introduce a global-local text encoder where the global feature captures complex sentence-level semantics of the entire input expression while the local feature focuses on the target noun phrase extracted by a dependency parser. In our experiments, the proposed method outperforms several zero-shot baselines of the task and even the weakly supervised referring expression segmentation method with substantial margins. Our code is available at https://github.com/Seonghoon-Yu/Zero-shot-RIS.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Referring Expression Comprehension | RefCOCO+ (val) | -- | 354 | |
| Referring Image Segmentation | RefCOCO (val) | mIoU48.77 | 259 | |
| Referring Expression Segmentation | RefCOCO (testA) | cIoU35.3 | 257 | |
| Referring Image Segmentation | RefCOCO+ (test-B) | mIoU35.34 | 252 | |
| Referring Image Segmentation | RefCOCO (test A) | mIoU55 | 230 | |
| Referring Expression Segmentation | RefCOCO+ (testA) | cIoU24.9 | 230 | |
| Referring Expression Segmentation | RefCOCO+ (val) | cIoU26.2 | 223 | |
| Referring Expression Segmentation | RefCOCO (testB) | cIoU24.7 | 213 | |
| Referring Expression Segmentation | RefCOCO (val) | cIoU24.9 | 212 | |
| Referring Expression Segmentation | RefCOCO+ (testB) | cIoU25.8 | 210 |