video-SALMONN 2: Caption-Enhanced Audio-Visual Large Language Models

About

We present video-SALMONN 2, a family of audio-visual large language models that set new state-of-the-art (SOTA) results in video description and question answering (QA). Our core contribution is multi-round direct preference optimisation (MrDPO), paired with a caption-quality objective that jointly rewards completeness and factual accuracy. Unlike standard DPO with a fixed reference policy, MrDPO periodically refreshes the reference by bootstrapping from a newly re-initialised lightweight adapter trained on the latest preferences, avoiding reference staleness and enabling continual improvement. This strategy produces captions that are consistently more detailed and accurate than those from proprietary systems such as GPT-4o and Gemini-1.5 Pro. We further distil these gains by using our model to generate a high-quality video-caption corpus for supervised fine-tuning of new models, transferring benefits beyond captioning to strong performance on complex video-QA tasks. Across widely used audio-visual and visual-only understanding benchmarks (including Video-MME, WorldSense, AVUT, Video-Holmes, DailyOmni, MLVU, and LVBench), our 3B and 7B models achieve SOTA results at comparable scales, while the 72B model surpasses all other open-source systems. Our source code, models, and data are released at \href{https://github.com/bytedance/video-SALMONN-2}{https://github.com/bytedance/video-SALMONN-2}.

Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, Chao Zhang• 2025

Related benchmarks

Task	Dataset	Result
Video Question Answering	ActivityNet-QA	Accuracy44.31	418
Video Understanding	VideoMME	--	222
Long Video Understanding	LVBench	Accuracy0.497	218
Audio-Visual Question Answering	AVQA	Accuracy57.6	85
Audio-visual understanding	WorldSense	Accuracy56.5	72
Long Video Understanding	MLVU (dev)	--	63
Audio-visual understanding	Daily-Omni	Accuracy79.4	58
Audiovisual Video Captioning	SALMONN 2 (test)	Miss Rate10	37
Audiovisual Video Captioning	UGC-VideoCap	Audio Score61.8	26
Multimodal Future Prediction	FutureOmni 1.0 (Overall)	Accuracy (Cartoon)43.59	20

Showing 10 of 35 rows

Other info

Follow for update

@wizwand_team Discord