Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models

About

Diffusion-based large language models (dLLMs) rely on bidirectional attention, which prevents lossless KV caching and requires a full forward pass at every denoising step. Existing approximate KV caching methods reduce this cost by selectively updating cached states, but their decision overhead scales with context length or model depth. We propose EntropyCache, a training-free KV caching method that uses the maximum entropy of newly decoded token distributions as a constant-cost signal for deciding when to recompute. Our design is grounded in two empirical observations: (1) decoded token entropy correlates with KV cache drift, providing a cheap proxy for cache staleness, and (2) feature volatility of decoded tokens persists for multiple steps after unmasking, motivating recomputation of the $k$ most recently decoded tokens. The skip-or-recompute decision requires only $O(V)$ computation per step, independent of context length and model scale. Experiments on LLaDA-8B-Instruct and Dream-7B-Instruct show that EntropyCache achieves $15.2\times$-$26.4\times$ speedup on standard benchmarks and $22.4\times$-$24.1\times$ on chain-of-thought benchmarks, with competitive accuracy and decision overhead accounting for only $0.5\%$ of inference time. Code is available at https://github.com/mscheong01/EntropyCache.

Minsoo Cheong, Donghyun Son, Woosang Lim, Sungjoo Yoo• 2026

Related benchmarks

TaskDatasetResultRank
ReasoningBBH 3-shot
BBH 3-shot Score55.86
49
Mathematical ReasoningGSM8K 8-shot
Accuracy81.2
26
Science Question AnsweringGPQA 0-shot (test)
Throughput38.73
14
Code GenerationHumanEval zero-shot max_gen_len=512
Accuracy58.54
12
Mathematical ReasoningGSM8K 4-shot max_gen_len=256
Accuracy (%)78.77
12
Code GenerationMBPP 3-shot max_gen_len=256
Accuracy48.8
12
Multi-task Language UnderstandingMMLU-Pro
Accuracy45.49
12
Mathematical ReasoningMATH500 4-shot max_gen_len=512
Accuracy43.2
12
Showing 8 of 8 rows

Other info

Follow for update