Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Neighbourhood Watch: Referring Expression Comprehension via Language-guided Graph Attention Networks

About

The task in referring expression comprehension is to localise the object instance in an image described by a referring expression phrased in natural language. As a language-to-vision matching task, the key to this problem is to learn a discriminative object feature that can adapt to the expression used. To avoid ambiguity, the expression normally tends to describe not only the properties of the referent itself, but also its relationships to its neighbourhood. To capture and exploit this important information we propose a graph-based, language-guided attention mechanism. Being composed of node attention component and edge attention component, the proposed graph attention mechanism explicitly represents inter-object relationships, and properties with a flexibility and power impossible with competing approaches. Furthermore, the proposed graph attention mechanism enables the comprehension decision to be visualisable and explainable. Experiments on three referring expression comprehension datasets show the advantage of the proposed approach.

Peng Wang, Qi Wu, Jiewei Cao, Chunhua Shen, Lianli Gao, Anton van den Hengel• 2018

Related benchmarks

TaskDatasetResultRank
Referring Expression ComprehensionRefCOCO (testA)
Accuracy0.766
342
Referring Expression ComprehensionRefCOCO+ (testB)
Accuracy53.4
244
Referring Expression ComprehensionRefCOCO+ (testA)
Accuracy64
216
Referring Expression ComprehensionRefCOCO (testB)
Accuracy66.4
205
Visual GroundingRefCOCO+ (testB)--
180
Visual GroundingRefCOCO (testA)--
123
Visual GroundingRefCOCOg (val)--
114
Visual GroundingReferCOCO v1 (testB)
Acc @ 0.566.4
30
Visual GroundingReferCOCO+ v1 (testA)
Acc@0.564
24
Visual GroundingReferCOCOg Google (val)
Accuracy @ 0.5 IoU61.78
16
Showing 10 of 10 rows

Other info

Follow for update