Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Question-Aware Gaussian Experts for Audio-Visual Question Answering

About

Audio-Visual Question Answering (AVQA) requires not only question-based multimodal reasoning but also precise temporal grounding to capture subtle dynamics for accurate prediction. However, existing methods mainly use question information implicitly, limiting focus on question-specific details. Furthermore, most studies rely on uniform frame sampling, which can miss key question-relevant frames. Although recent Top-K frame selection methods aim to address this, their discrete nature still overlooks fine-grained temporal details. This paper proposes QA-TIGER, a novel framework that explicitly incorporates question information and models continuous temporal dynamics. Our key idea is to use Gaussian-based modeling to adaptively focus on both consecutive and non-consecutive frames based on the question, while explicitly injecting question information and applying progressive refinement. We leverage a Mixture of Experts (MoE) to flexibly implement multiple Gaussian models, activating temporal experts specifically tailored to the question. Extensive experiments on multiple AVQA benchmarks show that QA-TIGER consistently achieves state-of-the-art performance. Code is available at https://aim-skku.github.io/QA-TIGER/

Hongyeob Kim, Inyoung Jung, Dayoon Suh, Youjia Zhang, Sangmin Lee, Sungeun Hong• 2025

Related benchmarks

TaskDatasetResultRank
Audio-Visual Question AnsweringMUSIC-AVQA 1.0 (test)
AV Localis Accuracy72.5
96
Audio-Visual Question AnsweringMUSIC-AVQA (test)
Acc (Avg)73.74
59
Audio Question AnsweringMUSIC-AVQA 1.0 (test)
Counting Accuracy84.86
43
Audio-Visual Question AnsweringMUSIC-AVQA Bias v2.0 (test)
Total Accuracy77.08
18
Audio-Visual Question AnsweringMUSIC-AVQA balanced v2.0 (test)
Total Accuracy70.22
18
Visual Question AnsweringMUSIC-AVQA v1.0 (test)
Accuracy (Count)0.8396
16
Audio-Visual Question AnsweringMUSIC-AVQA-R (test)
Audio QA Count (Head)82.67
13
Showing 7 of 7 rows

Other info

Code

Follow for update