Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LLaDA-MoE: A Sparse MoE Diffusion Language Model

About

We introduce LLaDA-MoE, a large language diffusion model with the Mixture-of-Experts (MoE) architecture, trained from scratch on approximately 20T tokens. LLaDA-MoE achieves competitive performance with significantly reduced computational overhead by maintaining a 7B-parameter capacity while activating only 1.4B parameters during inference. Our empirical evaluation reveals that LLaDA-MoE achieves state-of-the-art performance among diffusion language models with larger parameters, surpassing previous diffusion language models LLaDA, LLaDA 1.5, and Dream across multiple benchmarks. The instruct-tuned model LLaDA-MoE-7B-A1B-Instruct demonstrates capabilities comparable to Qwen2.5-3B-Instruct in knowledge understanding, code generation, mathematical reasoning, agent and alignment tasks, despite using fewer active parameters. Our results show that integrating a sparse MoE architecture into the training objective of masked diffusion language models still brings out MoE's strengths under efficient inference with few active parameters, and opens ample room for further exploration of diffusion language models. LLaDA-MoE models are available at Huggingface.

Fengqi Zhu, Zebin You, Yipeng Xing, Zenan Huang, Lin Liu, Yihong Zhuang, Guoshan Lu, Kangyu Wang, Xudong Wang, Lanning Wei, Hongrui Guo, Jiaqi Hu, Wentao Ye, Tieyuan Chen, Chenchen Li, Chengfu Tang, Haibo Feng, Jun Hu, Jun Zhou, Xiaolu Zhang, Zhenzhong Lan, Junbo Zhao, Da Zheng, Chongxuan Li, Jianguo Li, Ji-Rong Wen• 2025

Related benchmarks

TaskDatasetResultRank
Code GenerationHumanEval
Pass@161.6
850
Language UnderstandingMMLU
Accuracy67.2
756
Mathematical ReasoningMATH
Accuracy58.7
643
Mathematical ReasoningMATH
Accuracy36.1
535
Code GenerationHumanEval (test)--
444
Mathematical ReasoningGSM8K
Accuracy (GSM8K)82.4
358
Instruction FollowingIFEval
Accuracy (0-100)59.3
292
Code GenerationMBPP (test)--
276
Code GenerationMBPP
Pass@170
175
Code GenerationMBPP
Accuracy (%)52.4
146
Showing 10 of 33 rows

Other info

Follow for update