Learning to Answer Questions in Dynamic Audio-Visual Scenarios
About
In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos. The problem requires comprehensive multimodal understanding and spatio-temporal reasoning over audio-visual scenes. To benchmark this task and facilitate our study, we introduce a large-scale MUSIC-AVQA dataset, which contains more than 45K question-answer pairs covering 33 different question templates spanning over different modalities and question types. We develop several baselines and introduce a spatio-temporal grounded audio-visual network for the AVQA problem. Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-, V-, and AVQA approaches. We believe that our built dataset has the potential to serve as testbed for evaluating and promoting progress in audio-visual scene understanding and spatio-temporal reasoning. Code and dataset: http://gewu-lab.github.io/MUSIC-AVQA/
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio-Visual Question Answering | MUSIC-AVQA 1.0 (test) | AV Localis Accuracy76.38 | 96 | |
| Audio-Visual Question Answering | MUSIC-AVQA (test) | Acc (Avg)71.59 | 59 | |
| Audio Question Answering | MUSIC-AVQA 1.0 (test) | Counting Accuracy78.18 | 43 | |
| Overall Audio-Visual Question Answering | MUSIC-AVQA (test) | Overall Accuracy71.52 | 21 | |
| Audio-Visual Question Answering | MUSIC-AVQA | Accuracy71.5 | 21 | |
| Audio-Video Question Answering | MUSIC-AVQA | AV Temporal Acc0.671 | 19 | |
| Audio-Visual Question Answering | MUSIC-AVQA balanced v2.0 (test) | Total Accuracy71.02 | 18 | |
| Audio-Visual Question Answering | MUSIC-AVQA Bias v2.0 (test) | Total Accuracy73.07 | 18 | |
| Audio Question Answering | MUSIC-AVQA (test) | Accuracy (Avg)74.06 | 17 | |
| Visual Question Answering | MUSIC-AVQA v1.0 (test) | Accuracy (Count)0.7156 | 16 |