Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Dynin-Omni: Omnimodal Unified Large Diffusion Language Model

About

We present Dynin-Omni, the first masked-diffusion-based omnimodal foundation model that unifies text, image, and speech understanding and generation, together with video understanding, within a single architecture. Unlike autoregressive unified models that serialize heterogeneous modalities, or compositional unified models that require orchestration with external modality-specific decoders, Dynin-Omni natively formulates omnimodal modeling as masked diffusion over a shared discrete token space, enabling iterative refinement under bidirectional context. Dynin-Omni adopts a multi-stage training strategy with model-merging-based modality expansion and omnimodal alignment. We evaluate Dynin-Omni across 19 multimodal benchmarks spanning language reasoning, image generation and editing, video understanding, and speech recognition and synthesis. Dynin-Omni achieves 87.6 on GSM8K, 1733.6 on MME-P, 61.4 on VideoMME, 0.87 on GenEval, and 2.1 WER on LibriSpeech test-clean, consistently outperforming existing open-source unified models while remaining competitive with strong modality-specific expert systems. These results demonstrate the potential of masked diffusion as a unified paradigm for any-to-any modeling, providing a flexible foundation for real-time omnimodal systems, unified cross-modal retrieval and generation, and embodied multimodal agents.

Jaeik Kim, Woojin Kim, Jihwan Hong, Yejoon Lee, Sieun Hyeon, Mintaek Lim, Yunseok Han, Dogeun Kim, Hoeun Lee, Hyunggeun Kim, Jaeyoung Do• 2026

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy87.7
1455
Multimodal UnderstandingMMBench
Accuracy74.7
637
Multimodal UnderstandingMMMU
Accuracy51.6
437
Video UnderstandingMVBench
Accuracy62
425
Video Question AnsweringActivityNet-QA
Accuracy56.3
376
Science Question AnsweringARC Challenge
Accuracy68.6
342
Multimodal PerceptionMME Perception
Perception Score1.73e+3
79
Video Question AnsweringNextQA
Accuracy81.9
78
Temporal Video UnderstandingTempCompass--
68
Text-to-SpeechLibriSpeech clean (test)
WER2.1
66
Showing 10 of 19 rows

Other info

GitHub

Follow for update