Dynin-Omni: Omnimodal Unified Large Diffusion Language Model
About
We present Dynin-Omni, the first masked-diffusion-based omnimodal foundation model that unifies text, image, and speech understanding and generation, together with video understanding, within a single architecture. Unlike autoregressive unified models that serialize heterogeneous modalities, or compositional unified models that require orchestration with external modality-specific decoders, Dynin-Omni natively formulates omnimodal modeling as masked diffusion over a shared discrete token space, enabling iterative refinement under bidirectional context. Dynin-Omni adopts a multi-stage training strategy with model-merging-based modality expansion and omnimodal alignment. We evaluate Dynin-Omni across 19 multimodal benchmarks spanning language reasoning, image generation and editing, video understanding, and speech recognition and synthesis. Dynin-Omni achieves 87.6 on GSM8K, 1733.6 on MME-P, 61.4 on VideoMME, 0.87 on GenEval, and 2.1 WER on LibriSpeech test-clean, consistently outperforming existing open-source unified models while remaining competitive with strong modality-specific expert systems. These results demonstrate the potential of masked diffusion as a unified paradigm for any-to-any modeling, providing a flexible foundation for real-time omnimodal systems, unified cross-modal retrieval and generation, and embodied multimodal agents.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy87.7 | 1455 | |
| Multimodal Understanding | MMBench | Accuracy74.7 | 637 | |
| Multimodal Understanding | MMMU | Accuracy51.6 | 437 | |
| Video Understanding | MVBench | Accuracy62 | 425 | |
| Video Question Answering | ActivityNet-QA | Accuracy56.3 | 376 | |
| Science Question Answering | ARC Challenge | Accuracy68.6 | 342 | |
| Multimodal Perception | MME Perception | Perception Score1.73e+3 | 79 | |
| Video Question Answering | NextQA | Accuracy81.9 | 78 | |
| Temporal Video Understanding | TempCompass | -- | 68 | |
| Text-to-Speech | LibriSpeech clean (test) | WER2.1 | 66 |