Dynamic Fusion with Intra- and Inter- Modality Attention Flow for Visual Question Answering

About

Learning effective fusion of multi-modality features is at the heart of visual question answering. We propose a novel method of dynamically fusing multi-modal features with intra- and inter-modality information flow, which alternatively pass dynamic information between and across the visual and language modalities. It can robustly capture the high-level interactions between language and vision domains, thus significantly improves the performance of visual question answering. We also show that the proposed dynamic intra-modality attention flow conditioned on the other modality can dynamically modulate the intra-modality attention of the target modality, which is vital for multimodality feature fusion. Experimental evaluations on the VQA 2.0 dataset show that the proposed method achieves state-of-the-art VQA performance. Extensive ablation studies are carried out for the comprehensive analysis of the proposed method.

Gao Peng, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven Hoi, Xiaogang Wang, Hongsheng Li• 2018

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy70.59	712
Visual Question Answering	VQA v2 (test-std)	Accuracy70.34	486
Visual Question Answering	VQA 2.0 (test-dev)	Accuracy70.22	337
Science Question Answering	ScienceQA (test)	Average Accuracy60.72	273
Visual Question Answering	VQA 2.0 (val)	Accuracy (Overall)66.2	183
Science Question Answering	ScienceQA	IMG Score0.5449	64
Multimodal Science Question Answering	ScienceQA v1.0 (test)	Accuracy (Natural Language Component)64.03	31
Icon Question Answering	IconQA (test)	Accuracy (Img)77.72	13
pMCI classification	ADNI pMCI	Balanced Accuracy77.14	8
Visual Question Answering	VQA loc v2.0 (val)	Accuracy66.21	7

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord