Dynamic Fusion with Intra- and Inter- Modality Attention Flow for Visual Question Answering
About
Learning effective fusion of multi-modality features is at the heart of visual question answering. We propose a novel method of dynamically fusing multi-modal features with intra- and inter-modality information flow, which alternatively pass dynamic information between and across the visual and language modalities. It can robustly capture the high-level interactions between language and vision domains, thus significantly improves the performance of visual question answering. We also show that the proposed dynamic intra-modality attention flow conditioned on the other modality can dynamically modulate the intra-modality attention of the target modality, which is vital for multimodality feature fusion. Experimental evaluations on the VQA 2.0 dataset show that the proposed method achieves state-of-the-art VQA performance. Extensive ablation studies are carried out for the comprehensive analysis of the proposed method.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy70.59 | 664 | |
| Visual Question Answering | VQA v2 (test-std) | Accuracy70.34 | 466 | |
| Visual Question Answering | VQA 2.0 (test-dev) | Accuracy70.22 | 337 | |
| Science Question Answering | ScienceQA (test) | Average Accuracy60.72 | 208 | |
| Visual Question Answering | VQA 2.0 (val) | Accuracy (Overall)66.2 | 143 | |
| Science Question Answering | ScienceQA | IMG Score0.5449 | 49 | |
| Multimodal Science Question Answering | ScienceQA v1.0 (test) | Accuracy (Natural Language Component)64.03 | 31 | |
| Icon Question Answering | IconQA (test) | Accuracy (Img)77.72 | 13 | |
| pMCI classification | ADNI pMCI | Balanced Accuracy77.14 | 8 | |
| Visual Question Answering | VQA loc v2.0 (val) | Accuracy66.21 | 7 |