Sub-Token Routing for KV Cache Compression

About

Transformer inference often requires a large KV cache, especially for long-context language modeling and multimodal generation. Existing compression methods usually reduce cache cost by selecting, evicting, quantizing, or compressing cached tokens, or by reducing the visual-token sequence before language-model inference. We introduce sub-token routing, a KV-compression method that adds a finer control axis inside retained tokens. It splits each retained value vector into groups and keeps only selected groups, while leaving query and key states unchanged. The method is designed to work after token-level reduction. First, a token-reduction method determines which tokens are retained. Then, sub-token routing compresses the value states inside those retained tokens. Experiments under matched KV budgets show that adding sub-token routing improves token-level reduction performance in both LLM and VLM settings, including Quest on LLaMA-2-7B and Qwen2.5-7B, and FastV/VisionZip across LLaVA and Qwen-VL models. The gains are larger at smaller KV budgets, suggesting that value-group routing is especially useful when further token removal becomes costly. Overall, token-level reduction and sub-token routing provide complementary ways to reduce KV cost.

Wei Jiang, Wei Wang• 2026

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText-103 (val)	PPL19.97	290
Multi-task Language Understanding	MMLU (test)	--	107
Long-context Variable Tracking	variable-tracking	Accuracy79.5	16
Needle-in-the-haystack Retrieval	Needle-in-the-haystack 2500 tokens	Needle Accuracy100	16
Language Modeling	Cross-family Language Modeling	Score31.41	6
Language Understanding	MMLU	MMLU Score59.39	4

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord