Multi-modal Factorized Bilinear Pooling with Co-Attention Learning for Visual Question Answering
About
Visual question answering (VQA) is challenging because it requires a simultaneous understanding of both the visual content of images and the textual content of questions. The approaches used to represent the images and questions in a fine-grained manner and questions and to fuse these multi-modal features play key roles in performance. Bilinear pooling based models have been shown to outperform traditional linear models for VQA, but their high-dimensional representations and high computational complexity may seriously limit their applicability in practice. For multi-modal feature fusion, here we develop a Multi-modal Factorized Bilinear (MFB) pooling approach to efficiently and effectively combine multi-modal features, which results in superior performance for VQA compared with other bilinear pooling approaches. For fine-grained image and question representation, we develop a co-attention mechanism using an end-to-end deep network architecture to jointly learn both the image and question attentions. Combining the proposed MFB approach with co-attention learning in a new network architecture provides a unified model for VQA. Our experimental results demonstrate that the single MFB with co-attention model achieves new state-of-the-art performance on the real-world VQA dataset. Code available at https://github.com/yuzcccc/mfb.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 (test-std) | -- | 466 | |
| Visual Question Answering | VQA (test-dev) | Acc (All)62.2 | 147 | |
| Visual Dialog | VisDial 1.0 (val) | MRR0.6291 | 65 | |
| Visual Question Answer | VQA 1.0 (test-dev) | Overall Accuracy65.1 | 44 | |
| Visual Question Answering | VQA-RAD (test) | Open-ended Accuracy14.5 | 33 | |
| Medical Visual Question Answering | SLAKE (test) | Closed Accuracy75 | 29 | |
| Relationship Phrase Detection | VRD | Recall@5082.46 | 20 | |
| Image-Text Compositional Retrieval | Shoes | Recall@1036.59 | 14 | |
| Image-Text Compositional Retrieval | Birds-to-Words | Recall@1030.43 | 14 | |
| Relationship Detection | VRD | Recall@5016.84 | 8 |