Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Large Language Diffusion Models

About

The capabilities of large language models (LLMs) are widely regarded as relying on autoregressive models (ARMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA employs a forward data masking process and a reverse generation process, parameterized by a Transformer to predict masked tokens. It provides a principled generative approach for probabilistic inference by optimizing a likelihood lower bound. Across extensive benchmarks on general tasks, math, code, and so on, LLaDA demonstrates strong scalability and performs comparably to our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities in case studies such as multi-turn dialogue. Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task. Our findings show the promise of diffusion models for language modeling at scale and challenge the common assumption that core LLM capabilities discussed above inherently depend on ARMs. Project page and codes: https://ml-gsai.github.io/LLaDA-demo/.

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, Chongxuan Li• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Accuracy78.2
983
Automatic Speech RecognitionLibriSpeech (test-other)
WER5.22
966
Code GenerationHumanEval
Pass@145.12
850
Automatic Speech RecognitionLibriSpeech clean (test)
WER2.34
833
Mathematical ReasoningGSM8K (test)
Accuracy78.2
797
Language UnderstandingMMLU
Accuracy65.9
756
Commonsense ReasoningPIQA
Accuracy74.4
647
Mathematical ReasoningMATH
Accuracy27.3
535
ReasoningBBH--
507
Code GenerationHumanEval (test)--
444
Showing 10 of 184 rows
...

Other info

Follow for update