Which Heads Matter for Reasoning? RL-Guided KV Cache Compression
About
Reasoning large language models exhibit complex reasoning behaviors via extended chain-of-thought generation that are highly fragile to information loss during decoding, creating critical challenges for KV cache compression. Existing token-dropping methods directly disrupt reasoning chains by removing intermediate steps, while head-reallocation methods, designed for retrieval tasks, fail to preserve the heads essential for generative reasoning. However, no existing method can identify which attention heads genuinely maintain reasoning consistency and control generation termination. To address this, we propose RLKV, which uses reinforcement learning as a probe to discover which heads contribute to reasoning quality by directly optimizing their cache usage against actual generation outcomes. This discovery naturally leads to an efficient compression strategy: we allocate full KV cache to reasoning-critical heads while aggressively compressing others with constant-size KV cache. Experiments reveal that a fraction of heads proves essential for reasoning, enabling 20--60% cache reduction with near-lossless performance across diverse tasks and models, and up to 2.06x end-to-end speedup at 60% reduction.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | GSM8K | GSM8K Accuracy (%)94.9 | 204 | |
| Mathematical Reasoning | AIME24 Math | Performance (%)50 | 60 | |
| Code Generation | MBPP Code | Performance (%)82.4 | 60 | |
| Multiple-choice Question Answering | MMLU-Pro Chem. | Accuracy72.2 | 40 | |
| Multiple-choice Question Answering | MMLU-Pro Law | Accuracy27.6 | 40 | |
| Multiple-choice Question Answering | MMLU-Pro Phys. | Accuracy (%)69.8 | 40 | |
| Long-context Reasoning | LongReason 64K-input 70K context | Accuracy68.5 | 34 | |
| Multiple-choice Question Answering | MMLU-Pro CS | Performance56.8 | 20 | |
| Professional Knowledge Reasoning | MMLU-Pro | MMLU-Pro Chemistry Accuracy44.8 | 20 | |
| Question Answering | MMLU-Pro Computer Science | Accuracy63.2 | 20 |