Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration

About

Long-horizon video-audio reasoning and fine-grained pixel understanding impose conflicting requirements on omnimodal models: dense temporal coverage demands many low-resolution frames, whereas precise grounding calls for high-resolution inputs. We tackle this trade-off with a two-system architecture: a Global Reasoning System selects informative keyframes and rewrites the task at low spatial cost, while a Detail Understanding System performs pixel-level grounding on the selected high-resolution snippets. Because ``optimal'' keyframe selection and reformulation are ambiguous and hard to supervise, we formulate them as a reinforcement learning (RL) problem and present Omni-R1, an end-to-end RL framework built on Group Relative Policy Optimization. Omni-R1 trains the Global Reasoning System through hierarchical rewards obtained via online collaboration with the Detail Understanding System, requiring only one epoch of RL on small task splits. Experiments on two challenging benchmarks, namely Referring Audio-Visual Segmentation (RefAVS) and Reasoning Video Object Segmentation (REVOS), show that Omni-R1 not only surpasses strong supervised baselines but also outperforms specialized state-of-the-art models, while substantially improving out-of-domain generalization and mitigating multimodal hallucination. Our results demonstrate the first successful application of RL to large-scale omnimodal reasoning and highlight a scalable path toward universally foundation models.

Hao Zhong, Muzhi Zhu, Zongze Du, Zheng Huang, Canyu Zhao, Mingyu Liu, Wen Wang, Hao Chen, Chunhua Shen• 2025

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench--
425
Video UnderstandingVideo-MME
Overall Score60.7
92
Audio-visual understandingDailyOmni
Average Score46.8
69
Video UnderstandingLVBench
Average Score37.6
67
Multimodal Mathematical ReasoningMathVista mini (test)
Overall Accuracy64.7
48
Multi-modal ReasoningMathVision (test)
Accuracy (%)25.4
45
Audio-visual understandingWorldSense
Accuracy44.1
42
Video ReasoningVideo-MME
Overall Performance63.2
39
Audio ReasoningMMAR (test)
Sound Score67.3
38
Video ReasoningVideo-Holmes
Accuracy40.72
37
Showing 10 of 34 rows

Other info

Follow for update