WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention for Fast Inference

About

Autoregressive (AR) generation is the standard decoding paradigm for Large Language Models (LLMs), but its token-by-token nature limits parallelism at inference time. Diffusion Language Models (DLLMs) offer parallel decoding by recovering multiple masked tokens per step; however, in practice they often fail to translate this parallelism into deployment speed gains over optimized AR engines (e.g., vLLM). A key reason is that many DLLMs rely on bidirectional attention, which breaks standard prefix KV caching and forces repeated contextualization, undermining efficiency. We propose WeDLM, a diffusion decoding framework built entirely on standard causal attention to make parallel generation prefix-cache friendly. The core idea is to let each masked position condition on all currently observed tokens while keeping a strict causal mask, achieved by Topological Reordering that moves observed tokens to the physical prefix while preserving their logical positions. Building on this property, we introduce a streaming decoding procedure that continuously commits confident tokens into a growing left-to-right prefix and maintains a fixed parallel workload, avoiding the stop-and-wait behavior common in block diffusion methods. Experiments show that WeDLM preserves the quality of strong AR backbones while delivering substantial speedups, approaching 3x on challenging reasoning benchmarks and up to 10x in low-entropy generation regimes; critically, our comparisons are against AR baselines served by vLLM under matched deployment settings, demonstrating that diffusion-style decoding can outperform an optimized AR engine in practice.

Aiwei Liu, Minghua He, Shaoxun Zeng, Sijun Zhang, Linhao Zhang, Chuhan Wu, Wei Jia, Yuan Liu, Xiao Zhou, Jie Zhou• 2025

Related benchmarks

Task	Dataset	Result
Code Generation	HumanEval (test)	--	612
Code Generation	MBPP (test)	--	405
Mathematical Reasoning	GSM8K	Accuracy (Acc)90.2	337
General Knowledge	MMLU	MMLU General Knowledge Accuracy75.5	307
Code Generation	HumanEval+ (test)	Pass@168.9	132
Function-level Code Generation	HumanEval+ augmented (test)	Pass@173.8	65
Zero-shot Reasoning	ARC-Easy zero-shot	Zero-shot Accuracy97.43	41
Math & Science	MATH 4-shot	Score64.8	35
CUDA Kernel Generation	KernelBench Level 1	Exec Count14	31
CUDA Kernel Generation	KernelBench Level 2	Execution Count1	31

Showing 10 of 27 rows

Other info

Follow for update

@wizwand_team Discord