Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering

About

Video Question Answering is a task which requires an AI agent to answer questions grounded in video. This task entails three key challenges: (1) understand the intention of various questions, (2) capturing various elements of the input video (e.g., object, action, causality), and (3) cross-modal grounding between language and vision information. We propose Motion-Appearance Synergistic Networks (MASN), which embed two cross-modal features grounded on motion and appearance information and selectively utilize them depending on the question's intentions. MASN consists of a motion module, an appearance module, and a motion-appearance fusion module. The motion module computes the action-oriented cross-modal joint representations, while the appearance module focuses on the appearance aspect of the input video. Finally, the motion-appearance fusion module takes each output of the motion module and the appearance module as input, and performs question-guided fusion. As a result, MASN achieves new state-of-the-art performance on the TGIF-QA and MSVD-QA datasets. We also conduct qualitative analysis by visualizing the inference results of MASN. The code is available at https://github.com/ahjeongseo/MASN-pytorch.

Ahjeong Seo, Gi-Cheon Kang, Joonhan Park, Byoung-Tak Zhang• 2021

Related benchmarks

Task	Dataset	Result
Video Question Answering	MSRVTT-QA	Accuracy35.2	513
Video Question Answering	MSRVTT-QA (test)	Accuracy35.2	376
Video Question Answering	TGIF-QA	Accuracy84.4	161
Video Question Answering	TGIF-QA (test)	Accuracy59.5	89
Transition Video Question Answering	TGIF-QA (test)	Accuracy87.4	28
Video Question Answering	TGIF-QA Action original (test)	Accuracy84.4	17
Video Question Answering	Sports-QA (test)	Overall Score57	16
Transition Question Answering	TGIF-QA	Accuracy87.4	14
Frame-QA	TGIF-QA	Accuracy59.5	14

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord