Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MMaDA: Multimodal Large Diffusion Language Models

About

We introduce MMaDA, a novel class of multimodal diffusion foundation models designed to achieve superior performance across diverse domains such as textual reasoning, multimodal understanding, and text-to-image generation. The approach is distinguished by three key innovations: (i) MMaDA adopts a unified diffusion architecture with a shared probabilistic formulation and a modality-agnostic design, eliminating the need for modality-specific components. This architecture ensures seamless integration and processing across different data types. (ii) We implement a mixed long chain-of-thought (CoT) fine-tuning strategy that curates a unified CoT format across modalities. By aligning reasoning processes between textual and visual domains, this strategy facilitates cold-start training for the final reinforcement learning (RL) stage, thereby enhancing the model's ability to handle complex tasks from the outset. (iii) We propose UniGRPO, a unified policy-gradient-based RL algorithm specifically tailored for diffusion foundation models. Utilizing diversified reward modeling, UniGRPO unifies post-training across both reasoning and generation tasks, ensuring consistent performance improvements. Experimental results demonstrate that MMaDA-8B exhibits strong generalization capabilities as a unified multimodal foundation model. It surpasses powerful models like LLaMA-3-7B and Qwen2-7B in textual reasoning, outperforms Show-o and SEED-X in multimodal understanding, and excels over SDXL and Janus in text-to-image generation. These achievements highlight MMaDA's effectiveness in bridging the gap between pretraining and post-training within unified diffusion architectures, providing a comprehensive framework for future research and development. We open-source our code and trained models at: https://github.com/Gen-Verse/MMaDA

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, Mengdi Wang• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy86.1
1455
Mathematical ReasoningGSM8K--
1362
Commonsense ReasoningWinoGrande
Accuracy84.7
1085
Code GenerationHumanEval
Pass@174.5
1036
Multimodal UnderstandingMMBench
Accuracy68.5
637
Physical Commonsense ReasoningPIQA
Accuracy81.2
572
Text-to-Image GenerationGenEval
Overall Score63
506
Mathematical ReasoningGSM8K
Accuracy73.4
499
Multimodal UnderstandingMMMU
Accuracy30.2
437
Text-to-Image GenerationGenEval
Overall Score63
391
Showing 10 of 86 rows
...

Other info

Follow for update