Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Learning to Answer Questions in Dynamic Audio-Visual Scenarios

About

In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos. The problem requires comprehensive multimodal understanding and spatio-temporal reasoning over audio-visual scenes. To benchmark this task and facilitate our study, we introduce a large-scale MUSIC-AVQA dataset, which contains more than 45K question-answer pairs covering 33 different question templates spanning over different modalities and question types. We develop several baselines and introduce a spatio-temporal grounded audio-visual network for the AVQA problem. Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-, V-, and AVQA approaches. We believe that our built dataset has the potential to serve as testbed for evaluating and promoting progress in audio-visual scene understanding and spatio-temporal reasoning. Code and dataset: http://gewu-lab.github.io/MUSIC-AVQA/

Guangyao Li, Yake Wei, Yapeng Tian, Chenliang Xu, Ji-Rong Wen, Di Hu• 2022

Related benchmarks

TaskDatasetResultRank
Audio-Visual Question AnsweringMUSIC-AVQA 1.0 (test)
AV Localis Accuracy76.38
96
Audio-Visual Question AnsweringMUSIC-AVQA (test)
Acc (Avg)71.59
59
Audio Question AnsweringMUSIC-AVQA 1.0 (test)
Counting Accuracy78.18
43
Overall Audio-Visual Question AnsweringMUSIC-AVQA (test)
Overall Accuracy71.52
21
Audio-Visual Question AnsweringMUSIC-AVQA
Accuracy71.5
21
Audio-Video Question AnsweringMUSIC-AVQA
AV Temporal Acc0.671
19
Audio-Visual Question AnsweringMUSIC-AVQA balanced v2.0 (test)
Total Accuracy71.02
18
Audio-Visual Question AnsweringMUSIC-AVQA Bias v2.0 (test)
Total Accuracy73.07
18
Audio Question AnsweringMUSIC-AVQA (test)
Accuracy (Avg)74.06
17
Visual Question AnsweringMUSIC-AVQA v1.0 (test)
Accuracy (Count)0.7156
16
Showing 10 of 16 rows

Other info

Code

Follow for update