Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model

About

We present \textbf{LLaDA-o}, an effective and length-adaptive omni diffusion model for multimodal understanding and generation. LLaDA-o is built on a Mixture of Diffusion (MoD) framework that decouples discrete masked diffusion for text understanding and continuous diffusion for visual generation, while coupling them through a shared, simple, and efficient attention backbone that reduces redundant computation for fixed conditions. Building on MoD, we further introduce a data-centric length adaptation strategy that enables flexible-length decoding in multimodal settings without architectural changes. Extensive experiments show that LLaDA-o achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks, and reaches 87.04 on DPG-Bench for text-to-image generation, supporting the effectiveness of unified omni diffusion modeling. Code is available at https://github.com/ML-GSAI/LLaDA-o.

Zebin You, Xiaolu Zhang, Jun Zhou, Chongxuan Li, Ji-Rong Wen• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationGenEval
Overall Score86
506
Text-to-Image GenerationDPG-Bench
Overall Score87.04
265
Diagram UnderstandingAI2D
Accuracy79.3
247
Multimodal UnderstandingMME--
207
Multi-discipline Multimodal UnderstandingMMMU (val)
Accuracy44.9
204
Document Visual Question AnsweringDocVQA (val)
Accuracy91.5
157
Chart UnderstandingChartQA
Accuracy87.9
127
Mathematical ReasoningMathVista (testmini)
Accuracy66.1
103
Visual Question AnsweringInfoVQA (val)
Accuracy54.7
91
Image UnderstandingSEED-Bench Image
Accuracy75.3
27
Showing 10 of 12 rows

Other info

GitHub

Follow for update