Cosine meets Softmax: A tough-to-beat baseline for visual grounding

About

In this paper, we present a simple baseline for visual grounding for autonomous driving which outperforms the state of the art methods, while retaining minimal design choices. Our framework minimizes the cross-entropy loss over the cosine distance between multiple image ROI features with a text embedding (representing the give sentence/phrase). We use pre-trained networks for obtaining the initial embeddings and learn a transformation layer on top of the text embedding. We perform experiments on the Talk2Car dataset and achieve 68.7% AP50 accuracy, improving upon the previous state of the art by 8.6%. Our investigation suggests reconsideration towards more approaches employing sophisticated attention mechanisms or multi-stage reasoning or complex metric learning loss functions by showing promise in simpler alternatives.

Nivedita Rufus, Unni Krishnan R Nair, K. Madhava Krishna, Vineet Gandhi• 2020

Related benchmarks

Task	Dataset	Result
Visual Grounding	Corner-case Visual Constr.	IoU69.39	15
Visual Grounding	Corner-case Multi-agent	IoU66.77	15
Visual Grounding	Talk2Car	IoU68.61	15
Visual Grounding	Corner-case Ambiguous	IoU67.83	15
Visual Grounding	MoCAD (test)	IoU0.6766	15
Visual Grounding	DrivePilot (test)	IoU68.87	15
Visual Grounding	DrivePilot (val)	IoU69.93	15
Visual Grounding	MoCAD (val)	IoU68.47	15
Visual Grounding	Long-text (val)	IoU62.21	15

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord