Don't be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold

About

Key-value (KV) caching enables fast autoregressive decoding but at long contexts becomes a dominant bottleneck in High Bandwidth Memory (HBM) capacity and bandwidth. A common mitigation is to compress cached keys and values by projecting per-head matrices to a lower rank, storing only the projections in the HBM. However, existing post-training approaches typically fit these projections using SVD-style proxy objectives, which may poorly reflect end-to-end reconstruction after softmax, value mixing, and subsequent decoder-layer transformations. For these reasons, we introduce StiefAttention, a post-training KV-cache compression method that learns orthonormal projection bases by directly minimizing decoder-layer output reconstruction error. StiefAttention additionally constructs layer-wise error-rank profiles over candidate ranks, enabling sequential rank allocation under a user-specified KV cache budget. Notably, on Llama3-8B under the same conditions, StiefAttention outperforms EigenAttention by $4.2$ points on C4 perplexity and $8.9$ points on 0-shot MMLU accuracy at iso-compression, yielding lower relative error and higher cosine similarity with respect to the original decoder-layer outputs.

Luca Benfenati, Matteo Risso, Andrea Vannozzi, Ahmet Caner Y\"uz\"ug\"uler, Lukas Cavigelli, Enrico Macii, Daniele Jahier Pagliari, Alessio Burrello• 2026

Related benchmarks

Task	Dataset	Result
Language Modeling	C4	Perplexity24.58	1565
Language Modeling	WikiText	PPL9.96	740
Multiple-choice Question Answering	MMLU	Accuracy60	210
Multiple-choice Question Answering	HellaSwag	Accuracy58	196
Common-sense QA	PIQA	Accuracy79	10

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord