Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering

About

In this paper, we propose a novel end-to-end trainable Video Question Answering (VideoQA) framework with three major components: 1) a new heterogeneous memory which can effectively learn global context information from appearance and motion features; 2) a redesigned question memory which helps understand the complex semantics of question and highlights queried subjects; and 3) a new multimodal fusion layer which performs multi-step reasoning by attending to relevant visual and textual hints with self-updated attention. Our VideoQA model firstly generates the global context-aware visual and textual features respectively by interacting current inputs with memory contents. After that, it makes the attentional fusion of the multimodal visual and textual representations to infer the correct answer. Multiple cycles of reasoning can be made to iteratively refine attention weights of the multimodal data and improve the final representation of the QA pair. Experimental results demonstrate our approach achieves state-of-the-art performance on four VideoQA benchmark datasets.

Chenyou Fan, Xiaofan Zhang, Shu Zhang, Wensheng Wang, Chi Zhang, Heng Huang• 2019

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringMSRVTT-QA
Accuracy33.7
481
Video Question AnsweringMSRVTT-QA (test)
Accuracy33
371
Video Question AnsweringMSVD-QA
Accuracy33.7
340
Video Question AnsweringMSVD-QA (test)--
274
Video Question AnsweringNExT-QA (test)
Accuracy49.16
204
Video Question AnsweringNExT-QA (val)
Overall Acc48.72
176
Video Question AnsweringTGIF-QA
Accuracy53.8
147
Audio-Visual Question AnsweringMUSIC-AVQA 1.0 (test)
AV Localis Accuracy69.46
96
Video Question AnsweringTGIF-QA (test)
Accuracy77.8
89
Text-to-Video RetrievalMSRVTT 1k (test)
Recall@1057.7
63
Showing 10 of 38 rows

Other info

Code

Follow for update