d$^2$Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching

About

Diffusion-based large language models (dLLMs), despite their promising performance, still suffer from inferior inference efficiency. This is because dLLMs rely on bidirectional attention and cannot directly benefit from the standard key-value (KV) cache as autoregressive models (ARMs) do. To tackle this issue, we introduce \textit{Dual aDaptive Cache} (d$^2$Cache), which is a training-free approximate KV cache framework for accelerating dLLM inference. d$^2$Cache features a two-stage fine-grained selection strategy to identify tokens and adaptively update their KV states at each decoding step, while caching the KV states of the remaining tokens for reuse. Furthermore, d$^2$Cache naturally offers a more reliable decoding alternative, which can enable quasi left-to-right generation and mitigate premature overconfidence in tokens at the end of the sequence. Extensive experimental results on two representative dLLMs (\ie, LLaDA and Dream) demonstrate that d$^2$Cache not only achieves substantial inference speedups, but also yields consistent improvements in generation quality. The code is available at https://github.com/Kamichanw/d2Cache.

Yuchu Jiang, Yue Cai, Xiangzhong Luo, Jiale Fu, Jiarui Wang, Chonghan Liu, Xu Yang• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K 8-shot	Accuracy81.35	89
Reasoning	BBH 3-shot	BBH 3-shot Score59.51	49
Code Generation	HumanEval	TPS14.06	41
Code Generation	HumanEval	Accuracy64.02	39
Code Generation	MBPP	Score58	38
Code Generation	HumanEval 0-shot (test)	--	23
Code Generation	MBPP 3-shot (test)	Speedup Ratio48.3	19
Mathematical Reasoning	MATH 500	Throughput13.86	16
Multi-task Language Understanding	MMLU-Pro	Throughput10.12	16
Mathematical Reasoning	GSM8K 4-shot (test)	Throughput46.69	15

Showing 10 of 28 rows

Other info

Follow for update

@wizwand_team Discord