DiRL: An Efficient Post-Training Framework for Diffusion Language Models

About

Diffusion Language Models (dLLMs) have emerged as promising alternatives to Auto-Regressive (AR) models. While recent efforts have validated their pre-training potential and accelerated inference speeds, the post-training landscape for dLLMs remains underdeveloped. Existing methods suffer from computational inefficiency and objective mismatches between training and inference, severely limiting performance on complex reasoning tasks such as mathematics. To address this, we introduce DiRL, an efficient post-training framework that tightly integrates FlexAttention-accelerated blockwise training with LMDeploy-optimized inference. This architecture enables a streamlined online model update loop, facilitating efficient two-stage post-training (Supervised Fine-Tuning followed by Reinforcement Learning). Building on this framework, we propose DiPO, the first unbiased Group Relative Policy Optimization (GRPO) implementation tailored for dLLMs. We validate our approach by training DiRL-8B-Instruct on high-quality math data. Our model achieves state-of-the-art math performance among dLLMs and surpasses comparable models in the Qwen2.5 series on several benchmarks.

Ying Zhu, Jiaxin Wan, Xiaoran Liu, Siyang He, Qiqi Wang, Xu Guo, Tianyi Liang, Zengfeng Huang, Ziwei He, Xipeng Qiu• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH500	Accuracy (ACC)78.2	133
Code Reasoning	LiveCodeBench	Accuracy10.4	90
Mathematical Reasoning	AMC23	AVG@865.6	25
General Reasoning	GPQA	TPF227	14
Reasoning Performance (Aggregate)	AVG	TPF219	14
Mathematical Reasoning	AIME24	TPF1.96	14
Mathematical Reasoning	AIME25	TPF192	14

Showing 7 of 7 rows

Other info

GitHub

Follow for update

@wizwand_team Discord