Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Which Heads Matter for Reasoning? RL-Guided KV Cache Compression

About

Reasoning large language models exhibit complex reasoning behaviors via extended chain-of-thought generation that are highly fragile to information loss during decoding, creating critical challenges for KV cache compression. Existing token-dropping methods directly disrupt reasoning chains by removing intermediate steps, while head-reallocation methods, designed for retrieval tasks, fail to preserve the heads essential for generative reasoning. However, no existing method can identify which attention heads genuinely maintain reasoning consistency and control generation termination. To address this, we propose RLKV, which uses reinforcement learning as a probe to discover which heads contribute to reasoning quality by directly optimizing their cache usage against actual generation outcomes. This discovery naturally leads to an efficient compression strategy: we allocate full KV cache to reasoning-critical heads while aggressively compressing others with constant-size KV cache. Experiments reveal that a fraction of heads proves essential for reasoning, enabling 20--60% cache reduction with near-lossless performance across diverse tasks and models, and up to 2.06x end-to-end speedup at 60% reduction.

Wenjie Du, Li Jiang, Keda Tao, Xue Liu, Huan Wang• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
GSM8K Accuracy (%)94.9
204
Mathematical ReasoningAIME24 Math
Performance (%)50
60
Code GenerationMBPP Code
Performance (%)82.4
60
Multiple-choice Question AnsweringMMLU-Pro Chem.
Accuracy72.2
40
Multiple-choice Question AnsweringMMLU-Pro Law
Accuracy27.6
40
Multiple-choice Question AnsweringMMLU-Pro Phys.
Accuracy (%)69.8
40
Long-context ReasoningLongReason 64K-input 70K context
Accuracy68.5
34
Multiple-choice Question AnsweringMMLU-Pro CS
Performance56.8
20
Professional Knowledge ReasoningMMLU-Pro
MMLU-Pro Chemistry Accuracy44.8
20
Question AnsweringMMLU-Pro Computer Science
Accuracy63.2
20
Showing 10 of 10 rows

Other info

GitHub

Follow for update