HeatKV: Head-tuned KV-cache Compression for Visual Autoregressive Modeling

About

Visual Autoregressive (VAR) models have recently demonstrated impressive image generation quality while maintaining low latency. However, they suffer from severe KV-cache memory constraints, often requiring gigabytes of memory per generated image. We introduce HeatKV, a novel compression method that adapts cache allocation in each head based on its attention to previously generated scales. Using a small offline calibration set, the attention heads are ranked according to their attention scores over prior scales. Based on this ranking, we construct a static pruning schedule tailored to a given memory budget. Applied to the Infinity-2B model, HeatKV achieves $2 \times$ higher compression ratio in memory allocation for KV cache compared to existing methods, while maintaining similar or better image fidelity, prompt alignment and human perception score. Our method achieves a new state-of-the-art (SOTA) for VAR model KV-cache compression, showcasing the effectiveness of fine-grained, head-specific cache allocation. Code and calibration script available at https://github.com/arm-research/heatkv.

Jonathan Cederlund, Axel Berg, William Isaksson, Durmus Alp Emre Acar, Chuteng Zhou, Pontus Giselsson• 2026

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	GenEval	Overall Score (GenEval)0.824	153
Text-to-Image Generation	HPS v2.1	Overall Score30.86	153
Image Generation	MS COCO 2017	PSNR28.43	42

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord