Audio-Visual LLM for Video Understanding

About

This paper presents Audio-Visual LLM, a Multimodal Large Language Model that takes both visual and auditory inputs for holistic video understanding. A key design is the modality-augmented training, which involves the integration of modality-specific tokens engineered to activate the appropriate visual and/or auditory encoder selectively. This mechanism is pivotal in enabling end-to-end joint training with video data at different modalities, including visual-only, audio-only, and audio-visual formats. Moreover, we introduce a high-quality video instruction dataset, derived from GPT-4. This dataset allows Audio-Visual LLM to adeptly process a variety of task-oriented video instructions, ranging from multi-turn conversations and audio-visual narratives to complex reasoning tasks. Extensive experiments demonstrate that Audio-Visual LLM impressively achieves strong zero-shot results across a range of video understanding tasks. For example, Audio-Visual LLM achieves an accuracy of 53.7% on MSRVTT-QA, outperforming non-LLM-based InterVideo by 6.6% and LLM-based Valley by 4.4%, respectively. Additionally, our Audio-Visual LLM also achieves competitive performance on audio tasks (e.g., AudioCaps).

Fangxun Shu, Lei Zhang, Hao Jiang, Cihang Xie• 2023

Related benchmarks

Task	Dataset	Result
Audio-Visual Question Answering	MUSIC-AVQA (test)	--	76
Video Question Answering	ActivityNet (test)	Accuracy47.2	57
Audio-Visual Question Answering	AVQA (test)	Total Accuracy78.7	36
Audio-Visual Question Answering	MUSIC-AVQA	Accuracy45.2	33
Open-Ended Audio-Video QA	MUSIC-QA	Accuracy45.2	11
Audio-Video Understanding	MU-AVQA (test)	Accuracy45.2	9
Audio-Video Understanding	AVSD (test)	Accuracy52.6	9
Open-Ended Audio-Video QA	AVSD	Accuracy52.6	7
Audio-Visual Question Answering	AVSD	Accuracy52.6	6
Open-Ended Audio-Video QA	VGGSound	Accuracy0.476	6

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord