Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Referring Expression Object Segmentation with Caption-Aware Consistency

About

Referring expressions are natural language descriptions that identify a particular object within a scene and are widely used in our daily conversations. In this work, we focus on segmenting the object in an image specified by a referring expression. To this end, we propose an end-to-end trainable comprehension network that consists of the language and visual encoders to extract feature representations from both domains. We introduce the spatial-aware dynamic filters to transfer knowledge from text to image, and effectively capture the spatial information of the specified object. To better communicate between the language and visual modules, we employ a caption generation network that takes features shared across both domains as input, and improves both representations via a consistency that enforces the generated sentence to be similar to the given referring expression. We evaluate the proposed framework on two referring expression datasets and show that our method performs favorably against the state-of-the-art algorithms.

Yi-Wen Chen, Yi-Hsuan Tsai, Tiantian Wang, Yen-Yu Lin, Ming-Hsuan Yang• 2019

Related benchmarks

TaskDatasetResultRank
Referring Expression SegmentationRefCOCO (testA)--
217
Referring Image SegmentationRefCOCO (val)
mIoU58.9
197
Referring Expression SegmentationRefCOCO (testB)--
191
Referring Expression SegmentationRefCOCO (val)--
190
Referring Image SegmentationRefCOCO (test A)
mIoU61.77
178
Referring Image SegmentationRefCOCO (test-B)
mIoU53.81
119
Referring Expression SegmentationRefCOCOg (val)--
107
Referring Image SegmentationG-Ref (val)
mIoU46.37
95
Referring Expression SegmentationRefCOCOg (test)--
78
Referring Image SegmentationG-Ref Google split (val)
IoU44.32
58
Showing 10 of 29 rows

Other info

Code

Follow for update