d2: Improving Reasoning in Diffusion Language Models via Trajectory Likelihood Estimation

About

While diffusion language models (DLMs) have achieved competitive performance in text generation, improving their reasoning ability with reinforcement learning remains an active research area. Here, we introduce d2, a reasoning framework tailored for masked DLMs. Central to our framework is a new policy gradient algorithm that relies on accurate estimates of the sampling trajectory likelihoods. Because computing these likelihoods naively is computationally expensive for masked DLMs, we develop a family of estimators tailored to distinct model classes. For DLMs that support a sampling algorithm called any-order decoding, we propose d2-AnyOrder, which achieves exact trajectory likelihood with a single model pass. Through an empirical study of widely used DLMs, we show that any-order decoding is not universally supported in practice. For standard masked diffusion models, we propose d2-StepMerge, which approximates the trajectory likelihood, trading off compute for approximation accuracy in an analytically tractable manner. Empirically, d2 significantly outperforms widely-used RL baselines when applied to popular DLMs, and sets a new state-of-the-art performance for DLMs on logical reasoning tasks (Countdown and Sudoku) and math reasoning benchmarks (GSM8K and MATH500). We provide the code along with a blog post on the project page: https://guanghanwang.com/d2

Guanghan Wang, Gilad Turok, Yair Schiff, Marianne Arriola, Volodymyr Kuleshov• 2025

Related benchmarks

Task	Dataset	Result
Code Generation	HumanEval (test)	--	701
Code Generation	MBPP (test)	--	411
Mathematical Reasoning	Countdown	Accuracy56.6	252
Logical reasoning	Sudoku	Accuracy91.9	152
Reasoning	Sudoku	Pass@176.1	60
Reasoning	Countdown	Accuracy56.6	49
Constraint Satisfaction Reasoning	Countdown	Pass@152.4	34
Reasoning	Sudoku	Accuracy (Sudoku Reasoning)91.9	25
Toxicity Steering	Eso-LM (512 sequences)	Toxicity Score-9.2	12

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord