Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios

About

This paper focuses on the challenge of answering questions in scenarios that are composed of rich and complex dynamic audio-visual components. Although existing Multimodal Large Language Models (MLLMs) can respond to audio-visual content, these responses are sometimes ambiguous and fail to describe specific audio-visual events. To overcome this limitation, we introduce the CAT, which enhances MLLM in three ways: 1) besides straightforwardly bridging audio and video, we design a clue aggregator that aggregates question-related clues in dynamic audio-visual scenarios to enrich the detailed knowledge required for large language models. 2) CAT is trained on a mixed multimodal dataset, allowing direct application in audio-visual scenarios. Notably, we collect an audio-visual joint instruction dataset named AVinstruct, to further enhance the capacity of CAT to model cross-semantic correlations. 3) we propose AI-assisted ambiguity-aware direct preference optimization, a strategy specialized in retraining the model to favor the non-ambiguity response and improve the ability to localize specific audio-visual objects. Extensive experimental results demonstrate that CAT outperforms existing methods on multimodal tasks, especially in Audio-Visual Question Answering (AVQA) tasks. The codes and the collected instructions are released at https://github.com/rikeilong/Bay-CAT.

Qilang Ye, Zitong Yu, Rui Shao, Xinyu Xie, Philip Torr, Xiaochun Cao• 2024

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringMSRVTT-QA
Accuracy62.1
481
Video Question AnsweringActivityNet-QA
Accuracy50.2
319
Video-based generative performanceVideo-ChatGPT benchmark
Correctness Score61.6
76
Video Question AnsweringVCG Bench
CI3.08
42
Audio-Visual Question AnsweringMUSIC-AVQA
Accuracy48.6
21
Audio-Visual Question AnsweringAVQA
Accuracy92
14
Audio-Visual Question AnsweringMusic-AVQA 30 (test)
Overall Accuracy84.3
7
Audio-Visual Question AnsweringAVSD (test)
CIDEr79
6
Audio-Visual Question AnsweringAVSD 1 (test)
CIDEr79
6
Audio-Visual Question AnsweringAVQA 69 (test)
Accuracy92
5
Showing 10 of 11 rows

Other info

Code

Follow for update