Dynin-Omni: Omnimodal Unified Large Diffusion Language Model

About

We present Dynin-Omni, the first masked-diffusion-based omnimodal foundation model that unifies text, image, and speech understanding and generation, together with video understanding, within a single architecture. Unlike autoregressive unified models that serialize heterogeneous modalities, or compositional unified models that require orchestration with external modality-specific decoders, Dynin-Omni natively formulates omnimodal modeling as masked diffusion over a shared discrete token space, enabling iterative refinement under bidirectional context. Dynin-Omni adopts a multi-stage training strategy with model-merging-based modality expansion and omnimodal alignment. We evaluate Dynin-Omni across 19 multimodal benchmarks spanning language reasoning, image generation and editing, video understanding, and speech recognition and synthesis. Dynin-Omni achieves 87.6 on GSM8K, 1733.6 on MME-P, 61.4 on VideoMME, 0.87 on GenEval, and 2.1 WER on LibriSpeech test-clean, consistently outperforming existing open-source unified models while remaining competitive with strong modality-specific expert systems. These results demonstrate the potential of masked diffusion as a unified paradigm for any-to-any modeling, providing a flexible foundation for real-time omnimodal systems, unified cross-modal retrieval and generation, and embodied multimodal agents.

Jaeik Kim, Woojin Kim, Jihwan Hong, Yejoon Lee, Sieun Hyeon, Mintaek Lim, Yunseok Han, Dogeun Kim, Hoeun Lee, Hyunggeun Kim, Jaeyoung Do• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy87.7	2056
Multimodal Understanding	MMBench	Accuracy74.7	887
Video Understanding	MVBench	Accuracy62	635
Video Question Answering	ActivityNet-QA	Accuracy56.3	438
Multimodal Understanding	MMMU	Accuracy51.6	437
Science Question Answering	ARC Challenge	Accuracy68.6	354
Temporal Video Understanding	TempCompass	--	160
Multimodal Perception	MME Perception	Perception Score1.73e+3	99
Text-to-Speech	LibriSpeech clean (test)	WER2.1	97
Video Question Answering	NextQA	Accuracy81.9	92

Showing 10 of 19 rows

Other info

GitHub

Follow for update

@wizwand_team Discord