OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

About

Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6 times reduction compared to Qwen2.5-Omni's 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.

Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuanhang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, Yuming Lou, Dong Yang, Zhijian Liu, Yukang Chen, Ambrish Dantrey, Ehsan Jahangiri, Sreyan Ghosh, Daguang Xu, Ehsan Hosseini-Asl, Danial Mohseni Taheri, Vidya Murali, Sifei Liu, Yao Lu, Oluwatobi Olabiyi, Yu-Chiang Frank Wang, Rafael Valle, Bryan Catanzaro, Andrew Tao, Song Han, Jan Kautz, Hongxu Yin, Pavlo Molchanov• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy89.5	2019
Multimodal Understanding	MMMU	Accuracy49.7	437
Multimodal Perception	MME Perception	Perception Score1.65e+3	99
Audio-visual understanding	DailyOmni	Average Score66.5	83
Audio-visual understanding	WorldSense	Accuracy48.2	72
Video-driven Audio Hallucination	AVHBench	Accuracy58.56	27
Cross-modal hallucination evaluation	AVHBench	Overall Accuracy61.36	22
Omnimodal Video-Audio Question Answering	IMAVB 1.0	Std Dev Score (Video)75.4	18
Audio-visual understanding	Video-MME	Score68.6	15
Audiovisual Understanding & Reasoning	Daily-Omni	Score66.5	15

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord