Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Omni-o3: Deep Nested Omnimodal Deduction for Deliberative Audio-Visual Reasoning

About

Omnimodal understanding entails a massive, highly redundant search space of cross-modal interactions, demanding focused and deliberative reasoning. Current reasoning paradigms rely on either sequential step-by-step generation or parallel sample-by-sample rollouts, leading to isolated reasoning trajectories. This inability to share promising intermediate paths severely limits exploration efficiency and causes compounding errors in complex audio-visual tasks. To break this bottleneck, we introduce Omni-o3, a novel framework driven by a deep nested deduction policy. By formulating reasoning as a dynamic recursive search, Omni-o3 inherently shares reasoning prefixes across branches, enabling the iterative execution of four atomic cognitive actions: expansion, selection, simulation, and backpropagation. To empower this framework, we propose a robust two-stage training paradigm: (1) cold-start supervised fine-tuning on 101K high-quality, long-chain trajectories distilled from 3.5M diverse omnimodal samples, enabling necessary recursive search patterns; and (2) nested group rollout-driven exploratory reinforcement learning on 18K complex multi-turn samples, explicitly guided by a novel multi-step reward model to stimulate deep nested reasoning. Extensive experiments demonstrate that Omni-o3 achieves competitive performance across 11 benchmarks, unlocking advanced capabilities in comprehensive audio-visual, visual-centric, and audio-centric reasoning tasks.

Zhicheng Zhang, Wentao Gu, Weicheng Wang, Yongjie Zhu, Wenyu Qin, Meng Wang, Pengfei Wan, Jufeng Yang• 2026

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringVideoMMMU
Accuracy59.9
140
Spatio-Temporal ReasoningV-Star--
44
Audio Question AnsweringMMAU
Score75.6
28
Video Question AnsweringVideoEspresso
Accuracy50.9
24
Audio Question AnsweringMMSU
Score70.4
23
General & Complex QAVMME
Overall Accuracy72.6
21
Temporal GroundingCharades-STA
mIoU23.3
21
General & Complex QAWSense
Accuracy (All)53
19
General & Complex QAVHolmes
Overall Score53
19
General & Complex QAIntentB
Accuracy (All)68.4
19
Showing 10 of 23 rows

Other info

Follow for update