MKA: Memory-Keyed Attention for Efficient Long-Context Reasoning
About
As long-context language modeling becomes increasingly important, the cost of maintaining and attending to large Key/Value (KV) caches grows rapidly, becoming a major bottleneck in both training and inference. While prior works such as Multi-Query Attention (MQA) and Multi-Latent Attention (MLA) reduce memory by sharing or compressing KV features, they often trade off representation quality or incur runtime overhead. We propose Memory-Keyed Attention (MKA), a hierarchical attention mechanism that integrates multi-level KV caches (local, session, and long-term) and learns to route attention across them dynamically. We further introduce Route-Fused MKA (FastMKA), a broadcast-routed variant that fuses memory sources before attention computation for improved efficiency. Experiments on different sequence lengths show that FastMKA achieves a favorable accuracy-efficiency trade-off: comparable perplexity to MLA while achieving up to 5x faster training throughput and 1.8x lower evaluation latency. These results highlight MKA as a practical and extensible framework for efficient long-context attention.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Training Throughput Analysis | Qwen 7B 2.5 | Training Throughput (tokens/s)1.85e+3 | 28 | |
| Language Modeling Inference | Qwen2.5-7B 4K context length | Decode Latency (ms/token)6.2 | 4 | |
| Language Modeling Inference | Qwen2.5-7B 8K context length | Decode Latency (ms/token)7.1 | 4 | |
| Language Modeling Inference | Qwen2.5-7B 16K context length | Decode Latency (ms/token)8.4 | 4 | |
| Language Modeling Inference | Qwen2.5-7B 32K context length | Decoding Latency (ms/token)10.3 | 4 | |
| Language Modeling Inference | Qwen2.5-7B 64K context length | Decode Latency (ms/token)13.6 | 4 | |
| Language Modeling Inference | Qwen2.5-7B 128K context length | Decode Latency (ms/token)18.4 | 4 | |
| Language Modeling Inference | Qwen2.5-7B 256K context length | Decode Latency (ms/token)26.3 | 4 | |
| Language Modeling | Qwen2.5-7B | PPL3.26 | 4 | |
| Long-context Reasoning | LongBench | QA Score45.7 | 4 |