ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models
About
The long-output context generation of large reasoning models enables extended chain of thought (CoT) but also drives rapid growth of the key-value (KV) cache, quickly overwhelming GPU memory. To address this challenge, we propose ThinKV, a thought-adaptive KV cache compression framework. ThinKV is based on the observation that attention sparsity reveals distinct thought types with varying importance within the CoT. It applies a hybrid quantization-eviction strategy, assigning token precision by thought importance and progressively evicting tokens from less critical thoughts as reasoning trajectories evolve. Furthermore, to implement ThinKV, we design a kernel that extends PagedAttention to enable efficient reuse of evicted tokens' memory slots, eliminating compaction overheads. Extensive experiments on DeepSeek-R1-Distill, GPT-OSS, and NVIDIA AceReason across mathematics and coding benchmarks show that ThinKV achieves near-lossless accuracy with less than 5% of the original KV cache, while improving performance with up to 5.8x higher inference throughput over state-of-the-art baselines.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | GSM8K | Accuracy (Acc)60.1 | 337 | |
| Mathematical Reasoning | AIME | Accuracy33.2 | 18 | |
| Mathematical Reasoning | MATH 500 | Accuracy76 | 18 | |
| Long-form text generation | LongWriter | Accuracy67.9 | 18 | |
| Mathematical Reasoning | AIME | TPR (s)237.5 | 10 | |
| Reasoning | AIME | Pass@1 Accuracy70.28 | 8 | |
| Reasoning | LiveCodeBench | pass@1 Accuracy50.47 | 8 | |
| Text Generation Throughput | R1-Llama-8B 32K generation | Memory Footprint (%)2.51 | 7 | |
| Throughput Evaluation | vLLM | Throughput6.62e+3 | 5 |