Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration

About

While large language models (LLMs) excel at handling long-context sequences, they require substantial prefill computation and key-value (KV) cache, which can heavily burden computational efficiency and memory usage in both prefill and decoding stages. Recent works that compress KV caches with prefill acceleration reduce this cost but inadvertently tie the prefill compute reduction to the decoding KV budget. This coupling arises from overlooking the layer-dependent variation of critical context, often leading to accuracy degradation. To address this issue, we introduce FastKV, a KV cache compression framework designed to reduce latency in both prefill and decoding by leveraging the stabilization of token importance in later layers. FastKV performs full-context computation until a Token-Selective Propagation (TSP) layer, which forwards only the most informative tokens to subsequent layers. From these propagated tokens, FastKV independently selects salient KV entries for caching, thereby decoupling KV budget from the prefill compute reduction based on the TSP decision. This independent control of the TSP rate and KV retention rate enables flexible optimization of efficiency and accuracy. Experimental results show that FastKV achieves speedups of up to 1.82$\times$ in prefill and 2.87$\times$ in decoding compared to the full-context baseline, while matching the accuracy of the baselines that only accelerate the decoding stage. Our code is available at https://github.com/dongwonjo/FastKV.

Dongwon Jo, Jiwon Song, Yulhwa Kim, Jae-Joon Kim• 2025

Related benchmarks

TaskDatasetResultRank
Long-context Language UnderstandingLongBench
M-Avg48.04
292
Long-context Language UnderstandingLongBench (test)
Average Score51.07
147
Long-context Language UnderstandingLongBench
Average Score51.12
86
Long-context UnderstandingRULER
Performance @ 4K Context99
65
Long-context Language UnderstandingLongBench 1.0 (test)
MultiNews24.67
61
Long-context language modelingRULER--
51
Long-context retrievalRULER
Retrieval Accuracy (8K)77.8
34
Long-context UnderstandingInfiniteBench v1 (test)
Dialogue17
31
Long-context language modelingLongBench (test)
Qasper Score50.12
29
Long-context UnderstandingLongBench LLaMA-3.1-8B-Instruct (test)
NrtvQA30.54
14
Showing 10 of 16 rows

Other info

Follow for update