Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Latent-Condensed Transformer for Efficient Long Context Modeling

About

Large language models (LLMs) face significant challenges in processing long contexts due to the linear growth of the key-value (KV) cache and quadratic complexity of self-attention. Existing approaches address these bottlenecks separately: Multi-head Latent Attention (MLA) reduces the KV cache by projecting tokens into a low-dimensional latent space, while sparse attention reduces computation. However, sparse methods cannot operate natively on MLA's compressed latent structure, missing opportunities for joint optimization. In this paper, we propose Latent-Condensed Attention (LCA), which directly condenses context within MLA's latent space, where the representation is disentangled into semantic latent vectors and positional keys. LCA separately aggregates semantic vectors via query-aware pooling and preserves positional keys via anchor selection. This approach jointly reduces both computational cost and KV cache without adding parameters. Beyond MLA, LCA's design is architecture-agnostic and readily extends to other attention mechanisms such as GQA. Theoretically, we prove a length-independent error bound. Experiments show LCA achieves up to 2.5$\times$ prefilling speedup and 90% KV cache reduction at 128K context while maintaining competitive performance.

Zeng You, Yaofo Chen, Qiuwu Chen, Ying Sun, Shuhai Zhang, Yingjian Li, Yaowei Wang, Mingkui Tan• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM-8K
Accuracy41.17
57
Long-context language modelingRULER--
51
Multitask Language UnderstandingMMLU
Accuracy57.04
34
Long-context modelingLongBench-e
S. QA Accuracy22.61
5
Long-context Language UnderstandingLongBench-E 128K context
Average Score42.05
2
Mathematical ReasoningOlympiadBench Math
Accuracy50.13
2
Showing 6 of 6 rows

Other info

Follow for update