Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Grounding of Textual Phrases in Images by Reconstruction

About

Grounding (i.e. localizing) arbitrary, free-form textual phrases in visual content is a challenging problem with many applications for human-computer interaction and image-text reference resolution. Few datasets provide the ground truth spatial localization of phrases, thus it is desirable to learn from data with no or little grounding supervision. We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly. During training our approach encodes the phrase using a recurrent network language model and then learns to attend to the relevant image region in order to reconstruct the input phrase. At test time, the correct attention, i.e., the grounding, is evaluated. If grounding supervision is available it can be directly applied via a loss over the attention mechanism. We demonstrate the effectiveness of our approach on the Flickr 30k Entities and ReferItGame datasets with different levels of supervision, ranging from no supervision over partial supervision to full supervision. Our supervised variant improves by a large margin over the state-of-the-art on both datasets.

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, Bernt Schiele• 2015

Related benchmarks

TaskDatasetResultRank
Referring Expression ComprehensionRefCOCO (testA)
Accuracy0.7103
333
Referring Expression ComprehensionRefCOCO+ (testB)
Accuracy47.76
235
Referring Expression SegmentationRefCOCO (testA)--
217
Referring Expression ComprehensionRefCOCO+ (testA)
Accuracy54.32
207
Referring Expression ComprehensionRefCOCO (testB)
Accuracy65.77
196
Referring Expression SegmentationRefCOCO (testB)--
191
Referring Expression SegmentationRefCOCO (val)--
190
Referring Expression SegmentationRefCOCOg (val)--
107
Phrase LocalizationFlickr30K Entities (test)
Accuracy48.38
35
Visual GroundingFlickr30K Entities (test)
Accuracy47.81
29
Showing 10 of 20 rows

Other info

Follow for update