Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

NestedKV: Nested Memory Routing for Long-Context KV Cache Compression

About

Long-context language models are limited by the memory footprint of the key-value (KV) cache. Existing training-free KV compression methods usually rank tokens by one importance signal -- attention, recency, layer-wise allocation, or key distinctiveness -- which becomes brittle when useful context is globally distinctive, locally episodic, or immediately relevant. We introduce NestedKV, a key-only KV cache compression method inspired by the Continuum Memory System in Nested Learning. NestedKV maintains global, block-level, and sliding-window key anchors, scores tokens by multi-time-scale cosine anomaly, and combines the resulting rankings with a training-free outer learner using head-adaptive mixing and surprise-gated token routing. The score is paired with adaptive per-head budgets and requires no training or LLM modification. Across RULER (4k--32k), LooGLE, LongBench, LongBench-E, InfiniteBench, and MMLU-Pro on Qwen3 and Llama-3.2 models, NestedKV is strongest when the retained cache is small. On Qwen3-4B, it improves over KeyDiff by up to 19.10 points on RULER and 19.29 on LongBench at $r=0.75$; at $r=0.95$, it retains 37.32 on LongBench versus 17.55 for KeyDiff.

Hong Chen, Xiang Liu, Yubo Gao, Yuxuan Fan, Bo Wang, Yuanlin Chu, Yuanguo Lin, Xuming Hu• 2026

Related benchmarks

TaskDatasetResultRank
Long-context Language UnderstandingLongBench-e
Average Score48.69
93
Long-context retrieval and aggregationRULER 4k
Average Accuracy94.73
76
Long-context retrieval and aggregationRULER 8k
Average Accuracy94.24
76
Long-context retrieval and aggregationRULER 16k
Average Accuracy91.46
76
Long-context retrieval and aggregationRULER 32k
Average Accuracy88.75
76
Code DebuggingInfiniteBench code_debug 40k input cap
Accuracy34.01
19
Question Answering∞-Bench Longbook QA English (test)
F1 Score11.2
18
Showing 7 of 7 rows

Other info

Follow for update