CM-MaskSD: Cross-Modality Masked Self-Distillation for Referring Image Segmentation
About
Referring image segmentation (RIS) is a fundamental vision-language task that intends to segment a desired object from an image based on a given natural language expression. Due to the essentially distinct data properties between image and text, most of existing methods either introduce complex designs towards fine-grained vision-language alignment or lack required dense alignment, resulting in scalability issues or mis-segmentation problems such as over- or under-segmentation. To achieve effective and efficient fine-grained feature alignment in the RIS task, we explore the potential of masked multimodal modeling coupled with self-distillation and propose a novel cross-modality masked self-distillation framework named CM-MaskSD, in which our method inherits the transferred knowledge of image-text semantic alignment from CLIP model to realize fine-grained patch-word feature alignment for better segmentation accuracy. Moreover, our CM-MaskSD framework can considerably boost model performance in a nearly parameter-free manner, since it shares weights between the main segmentation branch and the introduced masked self-distillation branches, and solely introduces negligible parameters for coordinating the multimodal features. Comprehensive experiments on three benchmark datasets (i.e. RefCOCO, RefCOCO+, G-Ref) for the RIS task convincingly demonstrate the superiority of our proposed framework over previous state-of-the-art methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Referring Expression Segmentation | RefCOCO (testA) | cIoU75.21 | 217 | |
| Referring Expression Segmentation | RefCOCO+ (val) | cIoU64.47 | 201 | |
| Referring Image Segmentation | RefCOCO+ (test-B) | mIoU56.55 | 200 | |
| Referring Image Segmentation | RefCOCO (val) | mIoU72.18 | 197 | |
| Referring Expression Segmentation | RefCOCO (testB) | cIoU67.91 | 191 | |
| Referring Expression Segmentation | RefCOCO+ (testA) | cIoU69.29 | 190 | |
| Referring Expression Segmentation | RefCOCO (val) | cIoU72.18 | 190 | |
| Referring Expression Segmentation | RefCOCO+ (testB) | cIoU56.55 | 188 | |
| Referring Image Segmentation | RefCOCO (test A) | mIoU75.21 | 178 | |
| Referring Image Segmentation | RefCOCO (test-B) | mIoU67.91 | 119 |