Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

d$^2$Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching

About

Diffusion-based large language models (dLLMs), despite their promising performance, still suffer from inferior inference efficiency. This is because dLLMs rely on bidirectional attention and cannot directly benefit from the standard key-value (KV) cache as autoregressive models (ARMs) do. To tackle this issue, we introduce \textit{Dual aDaptive Cache} (d$^2$Cache), which is a training-free approximate KV cache framework for accelerating dLLM inference. d$^2$Cache features a two-stage fine-grained selection strategy to identify tokens and adaptively update their KV states at each decoding step, while caching the KV states of the remaining tokens for reuse. Furthermore, d$^2$Cache naturally offers a more reliable decoding alternative, which can enable quasi left-to-right generation and mitigate premature overconfidence in tokens at the end of the sequence. Extensive experimental results on two representative dLLMs (\ie, LLaDA and Dream) demonstrate that d$^2$Cache not only achieves substantial inference speedups, but also yields consistent improvements in generation quality. The code is available at https://github.com/Kamichanw/d2Cache.

Yuchu Jiang, Yue Cai, Xiangzhong Luo, Jiale Fu, Jiarui Wang, Chonghan Liu, Xu Yang• 2025

Related benchmarks

TaskDatasetResultRank
Code GenerationHumanEval
TPS14.06
41
Code GenerationMBPP
Score58
38
Code GenerationHumanEval 0-shot (test)
Pass@157.3
17
Mathematical ReasoningMATH 500
Throughput13.86
16
Multi-task Language UnderstandingMMLU-Pro
Throughput10.12
16
Mathematical ReasoningGSM8K 4-shot (test)
Throughput46.69
15
Code GenerationMBPP 3-shot (test)--
15
Code GenerationMBPP
Throughput12.67
8
Code GenerationHumanEval
Throughput14.36
8
Mathematical ReasoningGSM8K
Throughput12.37
8
Showing 10 of 21 rows

Other info

Follow for update