Deep Modular Co-Attention Networks for Visual Question Answering
About
Visual Question Answering (VQA) requires a fine-grained and simultaneous understanding of both the visual content of images and the textual content of questions. Therefore, designing an effective `co-attention' model to associate key words in questions with key objects in images is central to VQA performance. So far, most successful attempts at co-attention learning have been achieved by using shallow models, and deep co-attention models show little improvement over their shallow counterparts. In this paper, we propose a deep Modular Co-Attention Network (MCAN) that consists of Modular Co-Attention (MCA) layers cascaded in depth. Each MCA layer models the self-attention of questions and images, as well as the guided-attention of images jointly using a modular composition of two basic attention units. We quantitatively and qualitatively evaluate MCAN on the benchmark VQA-v2 dataset and conduct extensive ablation studies to explore the reasons behind MCAN's effectiveness. Experimental results demonstrate that MCAN significantly outperforms the previous state-of-the-art. Our best single model delivers 70.63$\%$ overall accuracy on the test-dev set. Code is available at https://github.com/MILVLG/mcan-vqa.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | GQA | Accuracy57.4 | 963 | |
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy75 | 664 | |
| Visual Question Answering | VQA v2 (test-std) | Accuracy70.9 | 466 | |
| Visual Question Answering | VQA 2.0 (test-dev) | Accuracy70.63 | 337 | |
| Visual Question Answering | OK-VQA (test) | Accuracy44.65 | 296 | |
| Science Question Answering | ScienceQA (test) | Average Accuracy54.54 | 208 | |
| Visual Question Answering | GQA (test-dev) | Accuracy57.4 | 178 | |
| Visual Question Answering | VQA 2.0 (val) | Accuracy (Overall)67.23 | 143 | |
| 3D Question Answering | ScanQA (val) | CIDEr64.9 | 133 | |
| Audio-Visual Question Answering | MUSIC-AVQA 1.0 (test) | AV Localis Accuracy71.18 | 96 |