ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

About

Autoregressive models (ARMs) are hindered by slow sequential inference. While masked diffusion models (MDMs) offer a parallel alternative, they suffer from critical drawbacks: high computational overhead from precluding Key-Value (KV) caching, and incoherent generation arising from learning dependencies over an intractable space of token combinations. To address these limitations, we introduce \textsc{ReFusion}, a novel masked diffusion model that integrates sequence reorganization into the causal attention framework. By elevating parallel decoding from the token level to a higher slot level, \textsc{ReFusion} interleaves inter-slot diffusion-based selection with intra-slot autoregressive infilling, while reordering newly generated slots ahead of the remaining masks after each iteration. Consequently, this design simultaneously unlocks full KV cache reuse and reduces learning complexity from an intractable token combination space to a manageable slot-level permutation space. Extensive experiments on seven diverse benchmarks show that \textsc{ReFusion} not only overwhelmingly surpasses prior MDMs with a 34\% performance gain and an over 18$\times$ speedup on average, but also bridges the performance gap to strong ARMs while maintaining a 2.33$\times$ average speedup.

Jia-Nan Li, Jian Guan, Wei Wu, Chongxuan Li• 2025

Related benchmarks

Task	Dataset	Result
Code Generation	HumanEval	Pass@178.66	1043
Mathematical Reasoning	GSM8K	--	499
Mathematical Problem Solving	MATH	Accuracy54.22	229
Math	GSM8K	Accuracy0.8491	216
Code Generation	MBPP	Pass@154.12	211
General Reasoning	MMLU-Pro	Accuracy45.94	201
Code Generation	MBPP	Accuracy (%)68.2	146
Reasoning	ARC-C	Accuracy89.76	112
Question Answering	GPQA Diamond	Pass@133.43	49
Question Answering	ARC	pass@187.98	30

Showing 10 of 12 rows

Other info

GitHub

Follow for update

@wizwand_team Discord