Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

AttnGrounder: Talking to Cars with Attention

About

We propose Attention Grounder (AttnGrounder), a single-stage end-to-end trainable model for the task of visual grounding. Visual grounding aims to localize a specific object in an image based on a given natural language text query. Unlike previous methods that use the same text representation for every image region, we use a visual-text attention module that relates each word in the given query with every region in the corresponding image for constructing a region dependent text representation. Furthermore, for improving the localization ability of our model, we use our visual-text attention module to generate an attention mask around the referred object. The attention mask is trained as an auxiliary task using a rectangular mask generated with the provided ground-truth coordinates. We evaluate AttnGrounder on the Talk2Car dataset and show an improvement of 3.26% over the existing methods.

Vivek Mittal• 2020

Related benchmarks

TaskDatasetResultRank
Visual GroundingCorner-case Ambiguous
IoU64.31
15
Visual GroundingTalk2Car
IoU61.32
15
Visual GroundingMoCAD (test)
IoU0.6234
15
Visual GroundingMoCAD (val)
IoU64.35
15
Visual GroundingDrivePilot (test)
IoU62.31
15
Visual GroundingDrivePilot (val)
IoU64.57
15
Visual GroundingCorner-case Visual Constr.
IoU62.74
15
Visual GroundingCorner-case Multi-agent
IoU64.82
15
Visual GroundingLong-text (val)
IoU57.25
15
Showing 9 of 9 rows

Other info

Follow for update