Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Deep Modular Co-Attention Networks for Visual Question Answering

About

Visual Question Answering (VQA) requires a fine-grained and simultaneous understanding of both the visual content of images and the textual content of questions. Therefore, designing an effective `co-attention' model to associate key words in questions with key objects in images is central to VQA performance. So far, most successful attempts at co-attention learning have been achieved by using shallow models, and deep co-attention models show little improvement over their shallow counterparts. In this paper, we propose a deep Modular Co-Attention Network (MCAN) that consists of Modular Co-Attention (MCA) layers cascaded in depth. Each MCA layer models the self-attention of questions and images, as well as the guided-attention of images jointly using a modular composition of two basic attention units. We quantitatively and qualitatively evaluate MCAN on the benchmark VQA-v2 dataset and conduct extensive ablation studies to explore the reasons behind MCAN's effectiveness. Experimental results demonstrate that MCAN significantly outperforms the previous state-of-the-art. Our best single model delivers 70.63$\%$ overall accuracy on the test-dev set. Code is available at https://github.com/MILVLG/mcan-vqa.

Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, Qi Tian• 2019

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringGQA
Accuracy57.4
963
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy75
664
Visual Question AnsweringVQA v2 (test-std)
Accuracy70.9
466
Visual Question AnsweringVQA 2.0 (test-dev)
Accuracy70.63
337
Visual Question AnsweringOK-VQA (test)
Accuracy44.65
296
Science Question AnsweringScienceQA (test)
Average Accuracy54.54
208
Visual Question AnsweringGQA (test-dev)
Accuracy57.4
178
Visual Question AnsweringVQA 2.0 (val)
Accuracy (Overall)67.23
143
3D Question AnsweringScanQA (val)
CIDEr64.9
133
Audio-Visual Question AnsweringMUSIC-AVQA 1.0 (test)
AV Localis Accuracy71.18
96
Showing 10 of 42 rows

Other info

Code

Follow for update