Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Don't be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold

About

Key--value (KV) caching enables fast autoregressive decoding but at long contexts becomes a dominant bottleneck in High Bandwidth Memory (HBM) capacity and bandwidth. A common mitigation is to compress cached keys and values by projecting per-head matrixes to a lower rank, storing only the projections in the HBM. However, existing post-training approaches typically fit these projections using SVD-style proxy objectives, which may poorly reflect end-to-end reconstruction after softmax, value mixing, and subsequent decoder-layer transformations. For these reasons, we introduce StiefAttention, a post-training KV-cache compression method that learns \emph{orthonormal} projection bases by directly minimizing \emph{decoder-layer output reconstruction error}. StiefAttention additionally precomputes, for each layer, an error-rank profile over candidate ranks, enabling flexible layer-wise rank allocation under a user-specified error budget. Noteworthy, on Llama3-8B under the same conditions, StiefAttention outperforms EigenAttention by $11.9$ points on C4 perplexity and $5.4\%$ on 0-shot MMLU accuracy at iso-compression, yielding lower relative error and higher cosine similarity with respect to the original decoder-layer outputs.

Luca Benfenati, Matteo Risso, Andrea Vannozzi, Ahmet Caner Y\"uz\"ug\"uler, Lukas Cavigelli, Enrico Macii, Daniele Jahier Pagliari, Alessio Burrello• 2026

Related benchmarks

TaskDatasetResultRank
Language ModelingC4
Perplexity24.58
1182
Language ModelingWikiText
PPL9.96
479
Multiple-choice Question AnsweringMMLU
Accuracy60
148
Multiple-choice Question AnsweringHellaSwag
Accuracy58
59
Common-sense QAPIQA
Accuracy79
10
Showing 5 of 5 rows

Other info

Follow for update