Stacked Attention Networks for Image Question Answering

About

This paper presents stacked attention networks (SANs) that learn to answer natural language questions from images. SANs use semantic representation of a question as query to search for the regions in an image that are related to the answer. We argue that image question answering (QA) often requires multiple steps of reasoning. Thus, we develop a multiple-layer SAN in which we query an image multiple times to infer the answer progressively. Experiments conducted on four image QA data sets demonstrate that the proposed SANs significantly outperform previous state-of-the-art approaches. The visualization of the attention layers illustrates the progress that the SAN locates the relevant visual clues that lead to the answer of the question layer-by-layer.

Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Smola• 2015

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy63	712
Visual Question Answering	VQA 2.0 (val)	Accuracy (Overall)61.7	183
Visual Question Answering	VQA v2 (val)	Accuracy55.61	158
Visual Question Answering	VQA (test-dev)	Acc (All)58.7	147
Visual Dialog	VisDial v0.9 (val)	MRR57.64	141
Visual Question Answering	VQA-CP v2 (test)	Overall Accuracy24.96	128
Visual Question Answering	VQA (test-std)	--	120
Open-Ended Visual Question Answering	VQA 1.0 (test-dev)	Overall Accuracy58.7	100
Medical Visual Question Answering	SLAKE (test)	Closed Accuracy79.1	67
Visual Question Answering	CLEVR (test)	Overall Accuracy68.5	61

Showing 10 of 44 rows

Other info

Follow for update

@wizwand_team Discord