Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention

About

While humans perceive the world through diverse modalities that operate synergistically to support a holistic understanding of their surroundings, existing omnivideo models still face substantial challenges on audio-visual understanding tasks. In this paper, we propose OmniVideo-R1, a novel reinforced framework that improves mixed-modality reasoning. OmniVideo-R1 empowers models to "think with omnimodal cues" by two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms; and (2) modality-attentive fusion built upon contrastive learning paradigms. Extensive experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines, highlighting its effectiveness and robust generalization capabilities.

Zhangquan Chen, Jiale Tao, Ruihuang Li, Yihao Hu, Ruitao Chen, Zhantao Yang, Xinlei Yu, Haodong Jing, Manyuan Zhang, Shuai Shao, Biao Wang, Qinglin Lu, Ruqi Huang• 2026

Related benchmarks

TaskDatasetResultRank
Video UnderstandingVideoMME--
192
Long Video UnderstandingLVBench
Accuracy0.519
63
Audio-visual understandingWorldSense
Accuracy65.8
32
Long Video UnderstandingMLVU (dev)--
31
Audio-visual understandingDaily-Omni
Accuracy82.8
27
Audio-visual understandingIntentBench
Accuracy74.2
11
Audio-Visual Joint ReasoningOmniVideoBench
Music Score40.7
11
Audio-visual understandingVideoHolmes
Accuracy62.9
10
Showing 8 of 8 rows

Other info

Follow for update