Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression

About

Sub-token routing provides a finer compression axis for transformer efficiency than the coarse units used in most prior work, such as tokens, pages, heads, or layers. In this paper, we study routing within a token representation itself in LoRA-adapted transformers. We consider two settings. In the query-independent setting, we combine routed subspace LoRA with value-group routing on the KV path for compression-aware language modeling. In the query-aware setting, we use a predictor-based selector to allocate a global retention budget over context-token/value-group pairs using query-conditioned relevance. Experiments show that the query-independent design improves language-model quality under reduced KV budgets, while the query-aware design preserves downstream behavior well under KV compression. We further show that sub-token routing is most effective as a complementary compression axis to token-level query-aware selection: token-level methods decide which tokens survive globally, while sub-token routing determines how the surviving tokens are compressed internally. Their combination enables deeper KV compression at nearly unchanged task accuracy.

Wei Jiang, Wei Wang• 2026

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText-103 (val)
PPL19.97
261
Multi-task Language UnderstandingMMLU (test)--
87
Long-context Variable Trackingvariable-tracking
Accuracy79.5
16
Needle-in-the-haystack RetrievalNeedle-in-the-haystack 2500 tokens
Needle Accuracy100
16
Language ModelingCross-family Language Modeling
Score31.41
6
Language UnderstandingMMLU
MMLU Score59.39
4
Showing 6 of 6 rows

Other info

Follow for update