Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding

About

Large language models (LLMs) have demonstrated remarkable capabilities in natural language and multimodal domains. By fine-tuning multimodal LLMs with temporal annotations from well-annotated datasets, e.g., dense video captioning datasets, their temporal understanding capacity in video-language tasks can be obtained. However, there is a notable lack of untrimmed audio-visual video datasets with precise temporal annotations for events. This deficiency hinders LLMs from learning the alignment between time, audio-visual events, and text tokens, thus impairing their ability to temporally localize audio-visual events in videos. To address this gap, we introduce PU-VALOR, a comprehensive audio-visual dataset comprising over 114,000 pseudo-untrimmed videos with detailed temporal annotations. PU-VALOR is derived from the large-scale but coarse-annotated audio-visual dataset VALOR, through a subtle method involving event-based video clustering, random temporal scaling, and permutation. By fine-tuning a multimodal LLM on PU-VALOR, we developed AVicuna, a model capable of aligning audio-visual events with temporal intervals and corresponding text tokens. AVicuna excels in temporal localization and time-aware dialogue capabilities. Our experiments demonstrate that AVicuna effectively handles temporal understanding in audio-visual videos and achieves state-of-the-art performance on open-ended video QA, audio-visual QA, and audio-visual event dense localization tasks.

Yolo Yunlong Tang, Daiki Shimada, Jing Bi, Mingqian Feng, Hang Hua, Chenliang Xu• 2024

Related benchmarks

Task	Dataset	Result
Audio-Visual Question Answering	MUSIC-AVQA (test)	--	94
Audio-Visual Question Answering	MUSIC-AVQA	Accuracy49.6	38
Multimodal Future Prediction	FutureOmni 1.0 (Overall)	Accuracy (Cartoon)31.62	20
Audio-to-Video temporal grounding	ChronusAV	BLEU-40.11	17
Text-to-Audio temporal grounding	ChronusAV	BLEU-40.01	17
Video-to-Audio temporal grounding	ChronusAV	BLEU-40.02	17
Open-Ended Audio-Video QA	MUSIC-QA	Accuracy49.6	11
Video-to-Text (V2T) Temporally Grounded Generation	ChronusAV	R@0.510.8	9
Text-to-Video (T2V) Temporally Grounded Generation	ChronusAV	BLEU-40.00e+0	9
Video-to-Text temporal grounding	ChronusAV	Recall@IoU=0.510.75	8

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord