Omni-o3: Deep Nested Omnimodal Deduction for Deliberative Audio-Visual Reasoning

About

Omnimodal understanding entails a massive, highly redundant search space of cross-modal interactions, demanding focused and deliberative reasoning. Current reasoning paradigms rely on either sequential step-by-step generation or parallel sample-by-sample rollouts, leading to isolated reasoning trajectories. This inability to share promising intermediate paths severely limits exploration efficiency and causes compounding errors in complex audio-visual tasks. To break this bottleneck, we introduce Omni-o3, a novel framework driven by a deep nested deduction policy. By formulating reasoning as a dynamic recursive search, Omni-o3 inherently shares reasoning prefixes across branches, enabling the iterative execution of four atomic cognitive actions: expansion, selection, simulation, and backpropagation. To empower this framework, we propose a robust two-stage training paradigm: (1) cold-start supervised fine-tuning on 101K high-quality, long-chain trajectories distilled from 3.5M diverse omnimodal samples, enabling necessary recursive search patterns; and (2) nested group rollout-driven exploratory reinforcement learning on 18K complex multi-turn samples, explicitly guided by a novel multi-step reward model to stimulate deep nested reasoning. Extensive experiments demonstrate that Omni-o3 achieves competitive performance across 11 benchmarks, unlocking advanced capabilities in comprehensive audio-visual, visual-centric, and audio-centric reasoning tasks.

Zhicheng Zhang, Wentao Gu, Weicheng Wang, Yongjie Zhu, Wenyu Qin, Meng Wang, Pengfei Wan, Jufeng Yang• 2026

Related benchmarks

Task	Dataset	Result
Video Question Answering	VideoMMMU	Accuracy59.9	166
Spatio-Temporal Reasoning	V-Star	--	44
Audio Question Answering	MMSU	Score70.4	31
Long Video Question Answering	LVBench	All Score48.1	31
Audio Question Answering	MMAU	Score75.6	28
Video Question Answering	VideoEspresso	Accuracy50.9	24
General & Complex QA	VMME	Overall Accuracy72.6	21
Temporal Grounding	Charades-STA	mIoU23.3	21
Voice Evaluation	VoiceBench	SD Score (VoiceBench)46.4	20
General & Complex QA	WSense	Accuracy (All)53	19

Showing 10 of 23 rows

Other info

Follow for update

@wizwand_team Discord