Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Simple Policy Gradients for Reasoning with Diffusion Language Models

About

Diffusion large language models (dLLMs), which offer a promising alternative to traditional autoregressive LLMs, have recently shown strong results in pretraining. However, due to their lack of tractable sequence-level likelihoods, they have yet to benefit from modern LLM post-training techniques such as reinforcement learning (RL), limiting their real-world applicability. Existing attempts at dLLM post-training rely on heuristic approximations or lower bounds of the true likelihood. In this work, we propose Amortized Group Relative Policy Optimization (AGRPO), a policy gradient algorithm that leverages the multi-step Markovian nature of dLLM generation, optimizing individual denoising steps rather than entire sequences. We demonstrate AGRPO's effectiveness on different math and reasoning tasks, achieving +9.9\% absolute gain on GSM8K, +4.6\% on MATH-500, +59.4\% on Countdown, and +69.7\% on Sudoku over the base LLaDA model, improving upon comparable dLLM RL methods such as diffu-GRPO. Furthermore, we analyze how post-training gains persist across different inference configurations, revealing that models trained with AGRPO can sample 4x faster with minimal performance sacrifices.

Anthony Zhan• 2025

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningWinoGrande
Accuracy87.3
1085
Code GenerationHumanEval
Pass@175.3
1036
Physical Commonsense ReasoningPIQA
Accuracy85.6
572
Code GenerationHumanEval+
Pass@170.2
383
Mathematical ReasoningMATH
Accuracy36.1
338
Science ReasoningGPQA
Accuracy27.3
243
Code GenerationMBPP+
Pass@171.7
216
Common Sense ReasoningHellaSwag
Accuracy85.1
213
Code GenerationMBPP
Pass@180.3
159
Code GenerationEvalPlus
Pass@169.3
61
Showing 10 of 12 rows

Other info

Follow for update