S$^3$-Attention:Attention-Aligned Endogenous Retrieval for Memory-Bounded Long-Context Inference
About
Large language models are increasingly applied to multi-document and long-form inputs, yet long-context inference remains memory- and noise-inefficient. Key-value (KV) caching scales linearly with context length, while external retrieval methods often return lexically similar but causally irrelevant passages. We present S3-Attention, a memory-first inference-time framework that treats long-context processing as attention-aligned endogenous retrieval. S3-Attention decodes transient key and query projections into top-k sparse feature identifiers using lightweight sparse autoencoders, and constructs a CPU-based inverted index mapping features to token positions or spans during a single streaming scan. This design allows the KV cache to be discarded entirely and bounds GPU memory usage by the scan chunk size. At generation time, feature co-activation is used to retrieve compact evidence spans, optionally fused with BM25 for exact lexical matching. Under a unified LongBench evaluation protocol with fixed prompting, decoding, and matched token budgets, S3-Hybrid closely matches full-context inference across multiple model families and improves robustness in several information-dense settings. We also report an engineering limitation of the current prototype, which incurs higher wall-clock latency than optimized full-KV baselines, motivating future kernel-level optimization.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-hop Question Answering | HotpotQA | F1 Score47.14 | 221 | |
| Long-context Language Understanding | LongBench | M-Avg24.87 | 219 | |
| Multi-hop Question Answering | MuSiQue | -- | 106 | |
| Question Answering | NarrativeQA | F1 Score11.3 | 87 | |
| Question Answering | Qasper | F1 Score21.87 | 61 | |
| Multi-hop Question Answering | 2WikiMHQA | F1 Score17.56 | 55 | |
| Summarization | news multi | Rouge-L23.66 | 21 | |
| Question Answering | en multifield | F1 Score43.54 | 21 | |
| Summarization | report gov | ROUGE-L19.55 | 21 |