Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MMaDA: Multimodal Large Diffusion Language Models

About

We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA's effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: https://github.com/Gen-Verse/MMaDA

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationGenEval
Overall Score63
467
Mathematical ReasoningGSM8K
Accuracy73.4
351
Visual Question AnsweringChartQA
Accuracy9.8
239
Visual Mathematical ReasoningMathVista
Accuracy33.7
189
Visual Question AnsweringAI2D
Accuracy66.6
174
Text-to-Image GenerationDPG
Overall Score69.97
131
Mathematical ReasoningMATH 500
Accuracy36
73
Visual Mathematical ReasoningMathVerse
Accuracy13.5
73
Multimodal UnderstandingMMBench en (dev)
Score68.5
38
Image GenerationGenEval
Overall Score63
26
Showing 10 of 15 rows

Other info

Follow for update