OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention
About
While humans perceive the world through diverse modalities that operate synergistically to support a holistic understanding of their surroundings, existing omnivideo models still face substantial challenges on audio-visual understanding tasks. In this paper, we propose OmniVideo-R1, a novel reinforced framework that improves mixed-modality reasoning. OmniVideo-R1 empowers models to "think with omnimodal cues" by two key strategies: (1) query-intensive grounding based on self-supervised learning paradigms; and (2) modality-attentive fusion built upon contrastive learning paradigms. Extensive experiments on multiple benchmarks demonstrate that OmniVideo-R1 consistently outperforms strong baselines, highlighting its effectiveness and robust generalization capabilities.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Understanding | VideoMME | -- | 192 | |
| Long Video Understanding | LVBench | Accuracy0.519 | 63 | |
| Audio-visual understanding | WorldSense | Accuracy65.8 | 32 | |
| Long Video Understanding | MLVU (dev) | -- | 31 | |
| Audio-visual understanding | Daily-Omni | Accuracy82.8 | 27 | |
| Audio-visual understanding | IntentBench | Accuracy74.2 | 11 | |
| Audio-Visual Joint Reasoning | OmniVideoBench | Music Score40.7 | 11 | |
| Audio-visual understanding | VideoHolmes | Accuracy62.9 | 10 |