LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
About
We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Generation | GenEval | Overall Score89 | 704 | |
| Optical Character Recognition | OCRBench | Score75.7 | 433 | |
| Multimodal Understanding | MMStar | -- | 407 | |
| Document Visual Question Answering | DocVQA | ANLS89.5 | 301 | |
| Text-to-Image Generation | DPG | Overall Score87.76 | 256 | |
| Multimodal Understanding | MMBench CN | -- | 254 | |
| Mathematical Reasoning | WeMath | -- | 225 | |
| Multimodal Reasoning | MMMU (val) | Accuracy50.1 | 168 | |
| Visual Question Answering | SimpleVQA | Accuracy0.44 | 164 | |
| Infographic Question Answering | InfoVQA | ANLS70.1 | 117 |