SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

About

Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods. While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood. Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku.

Chenyu Wang, Paria Rashidinejad, DiJia Su, Song Jiang, Sid Wang, Siyan Zhao, Cai Zhou, Shannon Zejiang Shen, Feiyu Chen, Tommi Jaakkola, Yuandong Tian, Bo Liu• 2025

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	WinoGrande	Accuracy84.3	1433
Mathematical Reasoning	GSM8K	--	1398
Code Generation	HumanEval	Pass@174.6	1043
Physical Commonsense Reasoning	PIQA	Accuracy80.9	692
Code Generation	HumanEval+	Pass@169.1	393
Mathematical Reasoning	MATH	Accuracy34.1	338
Mathematical Reasoning	GSM8K	Accuracy (Acc)86.1	334
Science Question Answering	ARC-C	Accuracy83.5	261
Mathematical Reasoning	Countdown	Accuracy71.5	252
Science Reasoning	GPQA	Accuracy25.9	243

Showing 10 of 43 rows

Other info

GitHub

Follow for update

@wizwand_team Discord