Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

About

Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, Lei Zhang• 2017

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VizWiz	Accuracy54.28	1863
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy65.32	721
Image Captioning	MS COCO Karpathy (test)	CIDEr1.319	706
Visual Question Answering	VQA v2 (test-std)	Accuracy72.91	486
Visual Question Answering	VQA 2.0 (test-dev)	Accuracy72.7	337
Science Question Answering	ScienceQA (test)	Average Accuracy59.02	273
Radiology Report Generation	MIMIC-CXR (test)	BLEU-40.092	235
Visual Question Answering	GQA (test)	Accuracy49.7	204
Visual Entailment	SNLI-VE (test)	Overall Accuracy70.3	199
Visual Question Answering	VQA 2.0 (val)	Accuracy (Overall)63.2	183

Showing 10 of 73 rows

...

Other info

Code

Follow for update

@wizwand_team Discord