AllMem: A Memory-centric Recipe for Efficient Long-context Modeling
About
Large Language Models (LLMs) encounter significant performance bottlenecks in long-sequence tasks due to the computational complexity and memory overhead inherent in the self-attention mechanism. To address these challenges, we introduce \textsc{AllMem}, a novel and efficient hybrid architecture that integrates Sliding Window Attention (SWA) with non-linear Test-Time Training (TTT) memory networks. \textsc{AllMem} enables models to effectively scale to ultra-long contexts while mitigating catastrophic forgetting. This approach not only overcomes the representation constraints typical of linear memory models but also significantly reduces the computational and memory footprint during long-sequence inference. Furthermore, we implement a Memory-Efficient Fine-Tuning strategy to replace standard attention layers in pre-trained models with memory-augmented sliding window layers. This framework facilitates the efficient transformation of any off-the-shelf pre-trained LLM into an \textsc{AllMem}-based architecture. Empirical evaluations confirm that our 4k window model achieves near-lossless performance on 37k LongBench with a marginal 0.83 drop compared to full attention. Furthermore, on InfiniteBench at a 128k context, our 8k window variant outperforms full attention, which validates the effectiveness of our parameterized memory in mitigating noise and maintaining robust long-range modeling without the prohibitive costs of global attention.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | WinoGrande | Accuracy53.9 | 776 | |
| Long-context Language Understanding | LongBench | M-Avg32.12 | 219 | |
| Reasoning | GPQA Diamond | Accuracy28.79 | 88 | |
| Long-context Question Answering | LongBench (test) | HotpotQA28.23 | 59 | |
| Knowledge | ARC Challenge | ARC-C Score74.4 | 31 | |
| Knowledge | ARC Easy | ARC-E Score84.7 | 31 | |
| Math | MATH 500 | Accuracy74.4 | 25 | |
| Coding | LiveCodeBench v5 | Accuracy25 | 18 | |
| Long-context Question Answering | LV-Eval | -- | 14 | |
| General Knowledge | HellaSwag | Accuracy59.4 | 13 |