Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Boosting Audio Visual Question Answering via Key Semantic-Aware Cues

About

The Audio Visual Question Answering (AVQA) task aims to answer questions related to various visual objects, sounds, and their interactions in videos. Such naturally multimodal videos contain rich and complex dynamic audio-visual components, with only a portion of them closely related to the given questions. Hence, effectively perceiving audio-visual cues relevant to the given questions is crucial for correctly answering them. In this paper, we propose a Temporal-Spatial Perception Model (TSPM), which aims to empower the model to perceive key visual and auditory cues related to the questions. Specifically, considering the challenge of aligning non-declarative questions and visual representations into the same semantic space using visual-language pretrained models, we construct declarative sentence prompts derived from the question template, to assist the temporal perception module in better identifying critical segments relevant to the questions. Subsequently, a spatial perception module is designed to merge visual tokens from selected segments to highlight key latent targets, followed by cross-modal interaction with audio to perceive potential sound-aware areas. Finally, the significant temporal-spatial cues from these modules are integrated to answer the question. Extensive experiments on multiple AVQA benchmarks demonstrate that our framework excels not only in understanding audio-visual scenes but also in answering complex questions effectively. Code is available at https://github.com/GeWu-Lab/TSPM.

Guangyao Li, Henghui Du, Di Hu• 2024

Related benchmarks

TaskDatasetResultRank
Audio-Visual Question AnsweringMUSIC-AVQA 1.0 (test)
AV Localis Accuracy71.85
96
Audio-Visual Question AnsweringMUSIC-AVQA (test)
Acc (Avg)76.79
59
Audio Question AnsweringMUSIC-AVQA 1.0 (test)
Counting Accuracy84.07
43
Visual Question AnsweringMUSIC-AVQA v1.0 (test)
Accuracy (Count)0.8229
16
Audio-Visual Question AnsweringMUSIC-AVQA-R (test)
Audio QA Count (Head)81.65
13
Audio-Visual Question AnsweringAVQA (test)
Total Accuracy90.8
13
Showing 6 of 6 rows

Other info

Follow for update