Scaling up Masked Diffusion Models on Text

About

Masked diffusion models (MDMs) have shown promise in language modeling, yet their scalability and effectiveness in core language tasks, such as text generation and language understanding, remain underexplored. This paper establishes the first scaling law for MDMs, demonstrating a scaling rate comparable to autoregressive models (ARMs) and a relatively small compute gap. Motivated by their scalability, we train a family of MDMs with up to 1.1 billion (B) parameters to systematically evaluate their performance against ARMs of comparable or larger sizes. Fully leveraging the probabilistic formulation of MDMs, we propose a simple yet effective unsupervised classifier-free guidance that effectively exploits large-scale unpaired data, boosting performance for conditional inference. In language understanding, the 1.1B MDM outperforms the 1.1B TinyLlama model trained on the same data across four of eight zero-shot benchmarks. Notably, it achieves competitive math reasoning ability with the 7B Llama-2 model on the GSM8K dataset. In text generation, MDMs with 16 times more pre-training time offer a flexible trade-off against ARMs with the accelerated sampling technique KV-Cache: MDMs match ARMs in performance while being 1.4 times faster during sampling. Moreover, MDMs address challenging tasks for ARMs by effectively handling bidirectional reasoning and adapting to temporal shifts in data. Notably, a 1.1B MDM breaks the reverse curse encountered by much larger ARMs with significantly more data and computation, such as 13B Llama-2 and 175B GPT-3. Our code is available at https://github.com/ML-GSAI/SMDM.

Shen Nie, Fengqi Zhu, Chao Du, Tianyu Pang, Qian Liu, Guangtao Zeng, Min Lin, Chongxuan Li• 2024

Related benchmarks

Task	Dataset	Result
Code Generation	HumanEval	Pass@113.2	1043
Commonsense Reasoning	PIQA	Accuracy60.3	757
Question Answering	ARC-E	Accuracy37.4	523
Code Generation	HumanEval+	Pass@130.5	393
Question Answering	OBQA	Accuracy27	347
Question Answering	BoolQ	Accuracy61.5	317
Common Sense Reasoning	BoolQ	Accuracy62.17	240
Code Generation	MBPP+	Pass@152.6	238
Commonsense Reasoning	OBQA	Accuracy33.4	187
Commonsense Reasoning	SIQA	Accuracy37.9	168

Showing 10 of 25 rows

Other info

Follow for update

@wizwand_team Discord