Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Don't be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold

About

Key-value (KV) caching enables fast autoregressive decoding but at long contexts becomes a dominant bottleneck in High Bandwidth Memory (HBM) capacity and bandwidth. A common mitigation is to compress cached keys and values by projecting per-head matrices to a lower rank, storing only the projections in the HBM. However, existing post-training approaches typically fit these projections using SVD-style proxy objectives, which may poorly reflect end-to-end reconstruction after softmax, value mixing, and subsequent decoder-layer transformations. For these reasons, we introduce StiefAttention, a post-training KV-cache compression method that learns orthonormal projection bases by directly minimizing decoder-layer output reconstruction error. StiefAttention additionally constructs layer-wise error-rank profiles over candidate ranks, enabling sequential rank allocation under a user-specified KV cache budget. Notably, on Llama3-8B under the same conditions, StiefAttention outperforms EigenAttention by $4.2$ points on C4 perplexity and $8.9$ points on 0-shot MMLU accuracy at iso-compression, yielding lower relative error and higher cosine similarity with respect to the original decoder-layer outputs.

Luca Benfenati, Matteo Risso, Andrea Vannozzi, Ahmet Caner Y\"uz\"ug\"uler, Lukas Cavigelli, Enrico Macii, Daniele Jahier Pagliari, Alessio Burrello• 2026

Related benchmarks

TaskDatasetResultRank
Language ModelingC4
Perplexity24.58
1565
Language ModelingWikiText
PPL9.96
740
Multiple-choice Question AnsweringMMLU
Accuracy60
210
Multiple-choice Question AnsweringHellaSwag
Accuracy58
196
Common-sense QAPIQA
Accuracy79
10
Showing 5 of 5 rows

Other info

Follow for update