Simple Policy Gradients for Reasoning with Diffusion Language Models

About

Diffusion large language models (dLLMs), which offer a promising alternative to traditional autoregressive LLMs, have recently shown strong results in pretraining. However, due to their lack of tractable sequence-level likelihoods, they have yet to benefit from modern LLM post-training techniques such as reinforcement learning (RL), limiting their real-world applicability. Existing attempts at dLLM post-training rely on heuristic approximations or lower bounds of the true likelihood. In this work, we propose Amortized Group Relative Policy Optimization (AGRPO), a policy gradient algorithm that leverages the multi-step Markovian nature of dLLM generation, optimizing individual denoising steps rather than entire sequences. We demonstrate AGRPO's effectiveness on different math and reasoning tasks, achieving +9.9\% absolute gain on GSM8K, +4.6\% on MATH-500, +59.4\% on Countdown, and +69.7\% on Sudoku over the base LLaDA model, improving upon comparable dLLM RL methods such as diffu-GRPO. Furthermore, we analyze how post-training gains persist across different inference configurations, revealing that models trained with AGRPO can sample 4x faster with minimal performance sacrifices.

Anthony Zhan• 2025

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	WinoGrande	Accuracy87.3	1442
Code Generation	HumanEval	Pass@175.3	1043
Physical Commonsense Reasoning	PIQA	Accuracy85.6	696
Code Generation	HumanEval+	Pass@170.2	393
Mathematical Reasoning	MATH	Accuracy36.1	338
Science Reasoning	GPQA	Accuracy27.3	243
Code Generation	MBPP+	Pass@171.7	238
Common Sense Reasoning	HellaSwag	Accuracy85.1	213
Code Generation	MBPP	Pass@180.3	211
Code Generation	EvalPlus	Pass@169.3	115

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord