Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing

About

Referring expression grounding aims at locating certain objects or persons in an image with a referring expression, where the key challenge is to comprehend and align various types of information from visual and textual domain, such as visual attributes, location and interactions with surrounding regions. Although the attention mechanism has been successfully applied for cross-modal alignments, previous attention models focus on only the most dominant features of both modalities, and neglect the fact that there could be multiple comprehensive textual-visual correspondences between images and referring expressions. To tackle this issue, we design a novel cross-modal attention-guided erasing approach, where we discard the most dominant information from either textual or visual domains to generate difficult training samples online, and to drive the model to discover complementary textual-visual correspondences. Extensive experiments demonstrate the effectiveness of our proposed method, which achieves state-of-the-art performance on three referring expression grounding datasets.

Xihui Liu, Zihao Wang, Jing Shao, Xiaogang Wang, Hongsheng Li• 2019

Related benchmarks

TaskDatasetResultRank
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy68.09
345
Referring Expression ComprehensionRefCOCO (val)
Accuracy78.35
335
Referring Expression ComprehensionRefCOCO (testA)
Accuracy0.8314
333
Referring Expression ComprehensionRefCOCO+ (testB)
Accuracy58.03
235
Referring Expression ComprehensionRefCOCO+ (testA)
Accuracy73.65
207
Referring Expression ComprehensionRefCOCO (testB)
Accuracy71.32
196
Visual GroundingRefCOCO+ (testB)--
169
Referring Expression ComprehensionRefCOCO (test-B)
Accuracy71.32
160
Visual GroundingRefCOCO (testA)--
117
Visual GroundingRefCOCOg (val)--
93
Showing 10 of 18 rows

Other info

Follow for update