Latent-Condensed Transformer for Efficient Long Context Modeling

About

Large language models (LLMs) face significant challenges in processing long contexts due to the linear growth of the key-value (KV) cache and quadratic complexity of self-attention. Existing approaches address these bottlenecks separately: Multi-head Latent Attention (MLA) reduces the KV cache by projecting tokens into a low-dimensional latent space, while sparse attention reduces computation. However, sparse methods cannot operate natively on MLA's compressed latent structure, missing opportunities for joint optimization. In this paper, we propose Latent-Condensed Attention (LCA), which directly condenses context within MLA's latent space, where the representation is disentangled into semantic latent vectors and positional keys. LCA separately aggregates semantic vectors via query-aware pooling and preserves positional keys via anchor selection. This approach jointly reduces both computational cost and KV cache without adding parameters. Beyond MLA, LCA's design is architecture-agnostic and readily extends to other attention mechanisms such as GQA. Theoretically, we prove a length-independent error bound. Experiments show LCA achieves up to 2.5$\times$ prefilling speedup and 90% KV cache reduction at 128K context while maintaining competitive performance.

Zeng You, Yaofo Chen, Qiuwu Chen, Ying Sun, Shuhai Zhang, Yingjian Li, Yaowei Wang, Mingkui Tan• 2026

Related benchmarks

Task	Dataset	Result
Multitask Language Understanding	MMLU	Accuracy57.04	263
Mathematical Reasoning	GSM-8K	Accuracy41.17	107
Long-context language modeling	RULER	Accuracy (8K Context)77.19	75
Long-context modeling	LongBench-e	S. QA Accuracy22.61	5
Long-context Language Understanding	LongBench-E 128K context	Average Score42.05	2
Mathematical Reasoning	OlympiadBench Math	Accuracy50.13	2

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord