video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

About

Speech understanding as an element of the more generic video understanding using audio-visual large language models (av-LLMs) is a crucial yet understudied aspect. This paper proposes video-SALMONN, a single end-to-end av-LLM for video processing, which can understand not only visual frame sequences, audio events and music, but speech as well. To obtain fine-grained temporal information required by speech understanding, while keeping efficient for other video elements, this paper proposes a novel multi-resolution causal Q-Former (MRC Q-Former) structure to connect pre-trained audio-visual encoders and the backbone large language model. Moreover, dedicated training approaches including the diversity loss and the unpaired audio-visual mixed training scheme are proposed to avoid frames or modality dominance. On the introduced speech-audio-visual evaluation benchmark, video-SALMONN achieves more than 25\% absolute accuracy improvements on the video-QA task and over 30\% absolute accuracy improvements on audio-visual QA tasks with human speech. In addition, video-SALMONN demonstrates remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other av-LLMs. Our training code and model checkpoints are available at \texttt{\url{https://github.com/bytedance/SALMONN/}}.

Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang• 2024

Related benchmarks

Task	Dataset	Result
Audio Question Answering	MMAR	Average Score42.5	47
Audiovisual Video Captioning	SALMONN 2 (test)	Miss Rate52.1	37
Audio Question Answering	MMAU	Score58.36	28
Multimodal Cloze	Omni-Cloze Audio	Accuracy10.6	18
Multimodal Cloze	Omni-Cloze	Visual Score3.5	16
Audio-visual understanding	AVUT AV-Human	Accuracy0.3833	12
Audio-Visual Question Answering	Daily-Omni	Score45	8
Audio-Visual Question Answering	Video-MME	Score41.8	8
Audio-Visual Question Answering	Video-Holmes	Score31.4	8

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord