OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM
About
Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6 times reduction compared to Qwen2.5-Omni's 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy89.5 | 1455 | |
| Multimodal Understanding | MMMU | Accuracy49.7 | 437 | |
| Multimodal Perception | MME Perception | Perception Score1.65e+3 | 79 | |
| Audio-visual understanding | DailyOmni | Average Score66.5 | 69 | |
| Audio-visual understanding | WorldSense | Accuracy48.2 | 42 | |
| Video-driven Audio Hallucination | AVHBench | Accuracy58.56 | 27 | |
| Cross-modal hallucination evaluation | AVHBench | Overall Accuracy61.36 | 22 | |
| Audio-visual understanding | Video-MME | Score68.6 | 15 | |
| Audiovisual Understanding & Reasoning | Daily-Omni | Score66.5 | 15 | |
| Audiovisual Dialogue Description | DiaDemBench | REF17.1 | 15 |