Extending CLIP's Image-Text Alignment to Referring Image Segmentation

About

Referring Image Segmentation (RIS) is a cross-modal task that aims to segment an instance described by a natural language expression. Recent methods leverage large-scale pretrained unimodal models as backbones along with fusion techniques for joint reasoning across modalities. However, the inherent cross-modal nature of RIS raises questions about the effectiveness of unimodal backbones. We propose RISCLIP, a novel framework that effectively leverages the cross-modal nature of CLIP for RIS. Observing CLIP's inherent alignment between image and text features, we capitalize on this starting point and introduce simple but strong modules that enhance unimodal feature extraction and leverage rich alignment knowledge in CLIP's image-text shared-embedding space. RISCLIP exhibits outstanding results on all three major RIS benchmarks and also outperforms previous CLIP-based methods, demonstrating the efficacy of our strategy in extending CLIP's image-text alignment to RIS.

Seoyeon Kim, Minguk Kang, Dongwon Kim, Jaesik Park, Suha Kwak• 2023

Related benchmarks

Task	Dataset	Result
Referring Image Segmentation	RefCOCO (val)	mIoU75.7	274
Referring Image Segmentation	RefCOCO+ (test-B)	mIoU60.7	267
Referring Image Segmentation	RefCOCO (test A)	mIoU78	245
Referring Image Segmentation	RefCOCO+ (val)	mIoU69.2	194
Referring Image Segmentation	RefCOCO (test-B)	mIoU72.5	186
Referring Image Segmentation	RefCOCOg (val)	--	114
Referring Image Segmentation	RefCOCO+ (testA)	mIoU73.5	112
Referring Image Segmentation	RefCOCOg (test)	--	75
Referring Expression Segmentation	RefCOCO UMD partition (test A)	--	34
Referring Image Segmentation	G-Ref UMD partition U (val)	oIoU64.1	24

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord