Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

video-SALMONN 2: Caption-Enhanced Audio-Visual Large Language Models

About

We present video-SALMONN 2, a family of audio-visual large language models that set new state-of-the-art (SOTA) results in video description and question answering (QA). Our core contribution is multi-round direct preference optimisation (MrDPO), paired with a caption-quality objective that jointly rewards completeness and factual accuracy. Unlike standard DPO with a fixed reference policy, MrDPO periodically refreshes the reference by bootstrapping from a newly re-initialised lightweight adapter trained on the latest preferences, avoiding reference staleness and enabling continual improvement. This strategy produces captions that are consistently more detailed and accurate than those from proprietary systems such as GPT-4o and Gemini-1.5 Pro. We further distil these gains by using our model to generate a high-quality video-caption corpus for supervised fine-tuning of new models, transferring benefits beyond captioning to strong performance on complex video-QA tasks. Across widely used audio-visual and visual-only understanding benchmarks (including Video-MME, WorldSense, AVUT, Video-Holmes, DailyOmni, MLVU, and LVBench), our 3B and 7B models achieve SOTA results at comparable scales, while the 72B model surpasses all other open-source systems. Our source code, models, and data are released at \href{https://github.com/bytedance/video-SALMONN-2}{https://github.com/bytedance/video-SALMONN-2}.

Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, Chao Zhang• 2025

Related benchmarks

TaskDatasetResultRank
Video UnderstandingVideoMME--
192
Long Video UnderstandingLVBench
Accuracy0.497
63
Audio-visual understandingWorldSense
Accuracy56.5
32
Long Video UnderstandingMLVU (dev)--
31
Audio-visual understandingDaily-Omni
Accuracy79.4
27
Audiovisual Video CaptioningSALMONN 2 (test)
Miss Rate21.2
26
Audiovisual Video CaptioningUGC-VideoCap
Audio Score61.8
26
Multimodal Future PredictionFutureOmni 1.0 (Overall)
Accuracy (Cartoon)43.59
20
Long Audio-Video Question AnsweringWorldSense
Average Accuracy56.5
18
Audiovisual Dialogue DescriptionDiaDemBench
REF11.5
15
Showing 10 of 17 rows

Other info

Follow for update