Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Dynamic Fusion with Intra- and Inter- Modality Attention Flow for Visual Question Answering

About

Learning effective fusion of multi-modality features is at the heart of visual question answering. We propose a novel method of dynamically fusing multi-modal features with intra- and inter-modality information flow, which alternatively pass dynamic information between and across the visual and language modalities. It can robustly capture the high-level interactions between language and vision domains, thus significantly improves the performance of visual question answering. We also show that the proposed dynamic intra-modality attention flow conditioned on the other modality can dynamically modulate the intra-modality attention of the target modality, which is vital for multimodality feature fusion. Experimental evaluations on the VQA 2.0 dataset show that the proposed method achieves state-of-the-art VQA performance. Extensive ablation studies are carried out for the comprehensive analysis of the proposed method.

Gao Peng, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven Hoi, Xiaogang Wang, Hongsheng Li• 2018

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy70.59
664
Visual Question AnsweringVQA v2 (test-std)
Accuracy70.34
466
Visual Question AnsweringVQA 2.0 (test-dev)
Accuracy70.22
337
Science Question AnsweringScienceQA (test)
Average Accuracy60.72
208
Visual Question AnsweringVQA 2.0 (val)
Accuracy (Overall)66.2
143
Science Question AnsweringScienceQA
IMG Score0.5449
49
Multimodal Science Question AnsweringScienceQA v1.0 (test)
Accuracy (Natural Language Component)64.03
31
Icon Question AnsweringIconQA (test)
Accuracy (Img)77.72
13
pMCI classificationADNI pMCI
Balanced Accuracy77.14
8
Visual Question AnsweringVQA loc v2.0 (val)
Accuracy66.21
7
Showing 10 of 10 rows

Other info

Follow for update