Gated Slot Attention for Efficient Linear-Time Sequence Modeling
About
Linear attention Transformers and their gated variants, celebrated for enabling parallel training and efficient recurrent inference, still fall short in recall-intensive tasks compared to traditional Transformers and demand significant resources for training from scratch. This paper introduces Gated Slot Attention (GSA), which enhances Attention with Bounded-memory-Control (ABC) by incorporating a gating mechanism inspired by Gated Linear Attention (GLA). Essentially, GSA comprises a two-layer GLA linked via $\operatorname{softmax}$, utilizing context-aware memory reading and adaptive forgetting to improve memory capacity while maintaining compact recurrent state size. This design greatly enhances both training and inference efficiency through GLA's hardware-efficient training algorithm and reduced state size. Additionally, retaining the $\operatorname{softmax}$ operation is particularly beneficial in "finetuning pretrained Transformers to RNNs" (T2R) settings, reducing the need for extensive training from scratch. Extensive experiments confirm GSA's superior performance in scenarios requiring in-context recall and in T2R settings.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | HellaSwag | Accuracy57 | 1460 | |
| Commonsense Reasoning | WinoGrande | Accuracy72.6 | 776 | |
| Question Answering | ARC Challenge | Accuracy46.9 | 749 | |
| Commonsense Reasoning | PIQA | Accuracy73.5 | 647 | |
| Language Modeling | WikiText | PPL14.8 | 479 | |
| Question Answering | ARC Easy | Accuracy76 | 386 | |
| Physical Commonsense Reasoning | PIQA | Accuracy78.9 | 329 | |
| Multitask Language Understanding | MMLU | Accuracy38.1 | 206 | |
| Language Modeling | LAMBADA | Accuracy52.7 | 183 | |
| Sentence Completion | HellaSwag | Accuracy77.9 | 133 |