Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering

About

Visual question answering (VQA) is challenging because it requires a simultaneous understanding of both the visual content of images and the textual content of questions. The approaches used to represent the images and questions in a fine-grained manner and questions and to fuse these multi-modal features play key roles in performance. Bilinear pooling based models have been shown to outperform traditional linear models for VQA, but their high-dimensional representations and high computational complexity may seriously limit their applicability in practice. For multi-modal feature fusion, here we develop a Multi-modal Factorized Bilinear (MFB) pooling approach to efficiently and effectively combine multi-modal features, which results in superior performance for VQA compared with other bilinear pooling approaches. For fine-grained image and question representation, we develop a co-attention mechanism using an end-to-end deep network architecture to jointly learn both the image and question attentions. Combining the proposed MFB approach with co-attention learning in a new network architecture provides a unified model for VQA. Our experimental results demonstrate that the single MFB with co-attention model achieves new state-of-the-art performance on the real-world VQA dataset. Code available at https://github.com/yuzcccc/mfb.

Zhou Yu, Jun Yu, Jianping Fan, Dacheng Tao• 2017

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2 (test-std)--
466
Visual Question AnsweringVQA (test-dev)
Acc (All)62.2
147
Visual DialogVisDial 1.0 (val)
MRR0.6291
65
Visual Question AnswerVQA 1.0 (test-dev)
Overall Accuracy65.1
44
Visual Question AnsweringVQA-RAD (test)
Open-ended Accuracy14.5
33
Medical Visual Question AnsweringSLAKE (test)
Closed Accuracy75
29
Relationship Phrase DetectionVRD
Recall@5082.46
20
Image-Text Compositional RetrievalShoes
Recall@1036.59
14
Image-Text Compositional RetrievalBirds-to-Words
Recall@1030.43
14
Relationship DetectionVRD
Recall@5016.84
8
Showing 10 of 11 rows

Other info

Follow for update