LLaDA-o: An Effective and Length-Adaptive Omni Diffusion Model
About
We present \textbf{LLaDA-o}, an effective and length-adaptive omni diffusion model for multimodal understanding and generation. LLaDA-o is built on a Mixture of Diffusion (MoD) framework that decouples discrete masked diffusion for text understanding and continuous diffusion for visual generation, while coupling them through a shared, simple, and efficient attention backbone that reduces redundant computation for fixed conditions. Building on MoD, we further introduce a data-centric length adaptation strategy that enables flexible-length decoding in multimodal settings without architectural changes. Extensive experiments show that LLaDA-o achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks, and reaches 87.04 on DPG-Bench for text-to-image generation, supporting the effectiveness of unified omni diffusion modeling. Code is available at https://github.com/ML-GSAI/LLaDA-o.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Generation | GenEval | Overall Score86 | 506 | |
| Text-to-Image Generation | DPG-Bench | Overall Score87.04 | 265 | |
| Diagram Understanding | AI2D | Accuracy79.3 | 247 | |
| Multimodal Understanding | MME | -- | 207 | |
| Multi-discipline Multimodal Understanding | MMMU (val) | Accuracy44.9 | 204 | |
| Document Visual Question Answering | DocVQA (val) | Accuracy91.5 | 157 | |
| Chart Understanding | ChartQA | Accuracy87.9 | 127 | |
| Mathematical Reasoning | MathVista (testmini) | Accuracy66.1 | 103 | |
| Visual Question Answering | InfoVQA (val) | Accuracy54.7 | 91 | |
| Image Understanding | SEED-Bench Image | Accuracy75.3 | 27 |