Which Heads Matter for Reasoning? RL-Guided KV Cache Compression

About

Reasoning large language models exhibit complex reasoning behaviors via extended chain-of-thought generation that are highly fragile to information loss during decoding, creating critical challenges for KV cache compression. Existing token-dropping methods directly disrupt reasoning chains by removing intermediate steps, while head-reallocation methods, designed for retrieval tasks, fail to preserve the heads essential for generative reasoning. However, no existing method can identify which attention heads genuinely maintain reasoning consistency and control generation termination. To address this, we propose RLKV, which uses reinforcement learning as a probe to discover which heads contribute to reasoning quality by directly optimizing their cache usage against actual generation outcomes. This discovery naturally leads to an efficient compression strategy: we allocate full KV cache to reasoning-critical heads while aggressively compressing others with constant-size KV cache. Experiments reveal that a fraction of heads proves essential for reasoning, enabling 20--60% cache reduction with near-lossless performance across diverse tasks and models, and up to 2.06x end-to-end speedup at 60% reduction.

Wenjie Du, Li Jiang, Keda Tao, Xue Liu, Huan Wang• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	GSM8K Accuracy (%)94.9	220
Mathematical Reasoning	AIME24 Math	Performance (%)50	60
Code Generation	MBPP Code	Performance (%)82.4	60
Multiple-choice Question Answering	MMLU-Pro Chem.	Accuracy72.2	40
Multiple-choice Question Answering	MMLU-Pro Law	Accuracy27.6	40
Multiple-choice Question Answering	MMLU-Pro Phys.	Accuracy (%)69.8	40
Long-context Reasoning	LongReason 64K-input 70K context	Accuracy68.5	34
Multiple-choice Question Answering	MMLU-Pro CS	Performance56.8	20
Professional Knowledge Reasoning	MMLU-Pro	MMLU-Pro Chemistry Accuracy44.8	20
Question Answering	MMLU-Pro Computer Science	Accuracy63.2	20

Showing 10 of 10 rows

Other info

GitHub

Follow for update

@wizwand_team Discord