Referring Image Segmentation Using Text Supervision

About

Existing Referring Image Segmentation (RIS) methods typically require expensive pixel-level or box-level annotations for supervision. In this paper, we observe that the referring texts used in RIS already provide sufficient information to localize the target object. Hence, we propose a novel weakly-supervised RIS framework to formulate the target localization problem as a classification process to differentiate between positive and negative text expressions. While the referring text expressions for an image are used as positive expressions, the referring text expressions from other images can be used as negative expressions for this image. Our framework has three main novelties. First, we propose a bilateral prompt method to facilitate the classification process, by harmonizing the domain discrepancy between visual and linguistic features. Second, we propose a calibration method to reduce noisy background information and improve the correctness of the response maps for target object localization. Third, we propose a positive response map selection strategy to generate high-quality pseudo-labels from the enhanced response maps, for training a segmentation network for RIS inference. For evaluation, we propose a new metric to measure localization accuracy. Experiments on four benchmarks show that our framework achieves promising performances to existing fully-supervised RIS methods while outperforming state-of-the-art weakly-supervised methods adapted from related areas. Code is available at https://github.com/fawnliu/TRIS.

Fang Liu, Yuhao Liu, Yuqiu Kong, Ke Xu, Lihe Zhang, Baocai Yin, Gerhard Hancke, Rynson Lau• 2023

Related benchmarks

Task	Dataset	Result
Referring Image Segmentation	RefCOCO (val)	mIoU41.1	283
Referring Image Segmentation	RefCOCO+ (test-B)	mIoU30.8	276
Referring Image Segmentation	RefCOCO (test A)	mIoU48.1	254
Referring Video Object Segmentation	Ref-DAVIS 2017 (val)	J&F16.5	240
Referring Image Segmentation	RefCOCO+ (val)	mIoU31.6	203
Referring Image Segmentation	RefCOCO (test-B)	mIoU31.9	195
Referring Image Segmentation	RefCOCO+ (testA)	mIoU31.9	121
Referring Video Object Segmentation	JHMDB Sentences (test)	Overall IoU0.472	110
Referring Image Segmentation	G-Ref (val)	mIoU36	95
Referring Image Segmentation	RefCOCO+ (test-A)	--	89

Showing 10 of 22 rows

Other info

Follow for update

@wizwand_team Discord