Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

About

Speech understanding as an element of the more generic video understanding using audio-visual large language models (av-LLMs) is a crucial yet understudied aspect. This paper proposes video-SALMONN, a single end-to-end av-LLM for video processing, which can understand not only visual frame sequences, audio events and music, but speech as well. To obtain fine-grained temporal information required by speech understanding, while keeping efficient for other video elements, this paper proposes a novel multi-resolution causal Q-Former (MRC Q-Former) structure to connect pre-trained audio-visual encoders and the backbone large language model. Moreover, dedicated training approaches including the diversity loss and the unpaired audio-visual mixed training scheme are proposed to avoid frames or modality dominance. On the introduced speech-audio-visual evaluation benchmark, video-SALMONN achieves more than 25\% absolute accuracy improvements on the video-QA task and over 30\% absolute accuracy improvements on audio-visual QA tasks with human speech. In addition, video-SALMONN demonstrates remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other av-LLMs. Our training code and model checkpoints are available at \texttt{\url{https://github.com/bytedance/SALMONN/}}.

Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang• 2024

Related benchmarks

TaskDatasetResultRank
Audiovisual Video CaptioningSALMONN 2 (test)
Miss Rate52.1
37
Audio Question AnsweringMMAR
Average Score42.5
35
Audio Question AnsweringMMAU
Score58.36
18
Multimodal ClozeOmni-Cloze Audio
Accuracy10.6
18
Multimodal ClozeOmni-Cloze
Visual Score3.5
16
Audio-Visual Question AnsweringDaily-Omni
Score45
8
Audio-Visual Question AnsweringVideo-MME
Score41.8
8
Audio-Visual Question AnsweringVideo-Holmes
Score31.4
8
Showing 8 of 8 rows

Other info

Follow for update