NestedKV: Nested Memory Routing for Long-Context KV Cache Compression

About

Long-context language models are limited by the memory footprint of the key-value (KV) cache. Existing training-free KV compression methods usually rank tokens by one importance signal -- attention, recency, layer-wise allocation, or key distinctiveness -- which becomes brittle when useful context is globally distinctive, locally episodic, or immediately relevant. We introduce NestedKV, a key-only KV cache compression method inspired by the Continuum Memory System in Nested Learning. NestedKV maintains global, block-level, and sliding-window key anchors, scores tokens by multi-time-scale cosine anomaly, and combines the resulting rankings with a training-free outer learner using head-adaptive mixing and surprise-gated token routing. The score is paired with adaptive per-head budgets and requires no training or LLM modification. Across RULER (4k--32k), LooGLE, LongBench, LongBench-E, InfiniteBench, and MMLU-Pro on Qwen3 and Llama-3.2 models, NestedKV is strongest when the retained cache is small. On Qwen3-4B, it improves over KeyDiff by up to 19.10 points on RULER and 19.29 on LongBench at $r=0.75$; at $r=0.95$, it retains 37.32 on LongBench versus 17.55 for KeyDiff.

Hong Chen, Xiang Liu, Yubo Gao, Yuxuan Fan, Bo Wang, Yuanlin Chu, Yuanguo Lin, Xuming Hu• 2026

Related benchmarks

Task	Dataset	Result
Long-context Language Understanding	LongBench-e	Average Score48.69	93
Long-context retrieval and aggregation	RULER 4k	Average Accuracy94.73	76
Long-context retrieval and aggregation	RULER 8k	Average Accuracy94.24	76
Long-context retrieval and aggregation	RULER 16k	Average Accuracy91.46	76
Long-context retrieval and aggregation	RULER 32k	Average Accuracy88.75	76
Code Debugging	InfiniteBench code_debug 40k input cap	Accuracy34.01	19
Question Answering	∞-Bench Longbook QA English (test)	F1 Score11.2	18

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord