dKV-Cache: The Cache for Diffusion Language Models
About
Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models. However, diffusion language models have long been constrained by slow inference. A core challenge is that their non-autoregressive architecture and bidirectional attention preclude the key-value cache that accelerates decoding. We address this bottleneck by proposing a KV-cache-like mechanism, delayed KV-Cache, for the denoising process of DLMs. Our approach is motivated by the observation that different tokens have distinct representation dynamics throughout the diffusion process. Accordingly, we propose a delayed and conditioned caching strategy for key and value states. We design two complementary variants to cache key and value step-by-step: (1) dKV-Cache-Decode, which provides almost lossless acceleration, and even improves performance on long sequences, suggesting that existing DLMs may under-utilise contextual information during inference. (2) dKV-Cache-Greedy, which has aggressive caching with reduced lifespan, achieving higher speed-ups with quadratic time complexity at the cost of some performance degradation. dKV-Cache, in final, achieves from 2-10x speedup in inference, largely narrowing the gap between ARs and DLMs. We evaluate our dKV-Cache on several benchmarks, delivering acceleration across general language understanding, mathematical, and code-generation benchmarks. Experiments demonstrate that cache can also be used in DLMs, even in a training-free manner from current DLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Code Generation | HumanEval (test) | Pass@115.37 | 444 | |
| Code Generation | MBPP (test) | -- | 276 | |
| Mathematical Reasoning | GSM8K | Speed Up (x)2.8 | 177 | |
| Code Generation | MBPP | Pass@120.4 | 175 | |
| Code Generation | HumanEval | Tokens/s18.9 | 61 | |
| Code Generation | HumanEval | Accuracy59.8 | 51 | |
| Mathematical Reasoning | MATH | Accuracy36.3 | 48 | |
| Mathematical Reasoning | GSM8K (test) | Accuracy81.3 | 30 | |
| Mathematical Reasoning | GSM8K 5-shot (test) | Strict Match Accuracy77.6 | 30 | |
| Code Generation | MBPP | Accuracy53.2 | 25 |