Context Unrolling in Omni Models

About

We present Omni, a unified multimodal model natively trained on diverse modalities, including text, images, videos, 3D geometry, and hidden representations. We find that such training enables Context Unrolling, where the model explicitly reasons across multiple modal representations before producing predictions. This process enables the model to aggregate complementary information across heterogeneous modalities, facilitating a more faithful approximation of the shared multimodal knowledge manifold and improving downstream reasoning fidelity. As a result, Omni achieves strong performance on both multimodal generation and understanding benchmarks, while demonstrating advanced multimodal reasoning capabilities, including in-context generation of text, image, video, and 3D geometry.

Ceyuan Yang, Zhijie Lin, Yang Zhao, Fei Xiao, Hao He, Qi Zhao, Chaorui Deng, Kunchang Li, Zihan Ding, Yuwei Guo, Fuyun Wang, Fangqi Zhu, Xiaonan Nie, Shenhan Zhu, Shanchuan Lin, Hongsheng Li, Weilin Huang, Guang Shi, Haoqi Fan• 2026

Related benchmarks

Task	Dataset	Result
Video Understanding	MVBench	Accuracy68.4	635
Multimodal Understanding	MMStar	Accuracy63.8	511
Diagram Understanding	AI2D	Accuracy91.5	377
Visual Question Answering	SimpleVQA	Accuracy0.533	225
Monocular Depth Estimation	ETH3D	AbsRel3.12	173
Monocular Depth Estimation	DIODE	AbsRel20.34	161
Chart Understanding	ChartQA	Accuracy86.9	159
Video Understanding	Video-MME without subtitles	--	145
Monocular Depth Estimation	Sintel	Abs Rel0.334	142
Camera pose estimation	CO3D v2	AUC@3075.21	132

Showing 10 of 30 rows

Other info

Follow for update

@wizwand_team Discord