Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Bilinear Attention Networks

About

Attention networks in multimodal learning provide an efficient way to utilize given visual information selectively. However, the computational cost to learn attention distributions for every pair of multimodal input channels is prohibitively expensive. To solve this problem, co-attention builds two separate attention distributions for each modality neglecting the interaction between multimodal inputs. In this paper, we propose bilinear attention networks (BAN) that find bilinear attention distributions to utilize given vision-language information seamlessly. BAN considers bilinear interactions among two groups of input channels, while low-rank bilinear pooling extracts the joint representations for each pair of channels. Furthermore, we propose a variant of multimodal residual networks to exploit eight-attention maps of the BAN efficiently. We quantitatively and qualitatively evaluate our model on visual question answering (VQA 2.0) and Flickr30k Entities datasets, showing that BAN significantly outperforms previous methods and achieves new state-of-the-arts on both datasets.

Jin-Hwa Kim, Jaehyun Jun, Byoung-Tak Zhang• 2018

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy70.04
706
Visual Question AnsweringVQA v2 (test-std)--
486
Visual Question AnsweringVQA 2.0 (test-dev)
Accuracy70
337
Visual Question AnsweringOK-VQA (test)
Accuracy25.17
327
Science Question AnsweringScienceQA (test)
Average Accuracy59.37
245
Visual Question AnsweringGQA (test-dev)
Accuracy55.2
184
Visual Question AnsweringVQA 2.0 (val)
Accuracy (Overall)66.04
143
Visual Question AnsweringVQA-CP v2 (test)
Overall Accuracy39.31
128
Visual Question AnsweringVizWiz (test)
Accuracy51.4
79
Visual Question AnsweringOK-VQA v1.0 (test)
Accuracy25.17
77
Showing 10 of 44 rows

Other info

Follow for update