Sub-Token Routing in LoRA for Adaptation and Query-Aware KV Compression
About
Sub-token routing provides a finer compression axis for transformer efficiency than the coarse units used in most prior work, such as tokens, pages, heads, or layers. In this paper, we study routing within a token representation itself in LoRA-adapted transformers. We consider two settings. In the query-independent setting, we combine routed subspace LoRA with value-group routing on the KV path for compression-aware language modeling. In the query-aware setting, we use a predictor-based selector to allocate a global retention budget over context-token/value-group pairs using query-conditioned relevance. Experiments show that the query-independent design improves language-model quality under reduced KV budgets, while the query-aware design preserves downstream behavior well under KV compression. We further show that sub-token routing is most effective as a complementary compression axis to token-level query-aware selection: token-level methods decide which tokens survive globally, while sub-token routing determines how the surviving tokens are compressed internally. Their combination enables deeper KV compression at nearly unchanged task accuracy.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Modeling | WikiText-103 (val) | PPL19.97 | 261 | |
| Multi-task Language Understanding | MMLU (test) | -- | 87 | |
| Long-context Variable Tracking | variable-tracking | Accuracy79.5 | 16 | |
| Needle-in-the-haystack Retrieval | Needle-in-the-haystack 2500 tokens | Needle Accuracy100 | 16 | |
| Language Modeling | Cross-family Language Modeling | Score31.41 | 6 | |
| Language Understanding | MMLU | MMLU Score59.39 | 4 |