Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Gated Slot Attention for Efficient Linear-Time Sequence Modeling

About

Linear attention Transformers and their gated variants, celebrated for enabling parallel training and efficient recurrent inference, still fall short in recall-intensive tasks compared to traditional Transformers and demand significant resources for training from scratch. This paper introduces Gated Slot Attention (GSA), which enhances Attention with Bounded-memory-Control (ABC) by incorporating a gating mechanism inspired by Gated Linear Attention (GLA). Essentially, GSA comprises a two-layer GLA linked via $\operatorname{softmax}$, utilizing context-aware memory reading and adaptive forgetting to improve memory capacity while maintaining compact recurrent state size. This design greatly enhances both training and inference efficiency through GLA's hardware-efficient training algorithm and reduced state size. Additionally, retaining the $\operatorname{softmax}$ operation is particularly beneficial in "finetuning pretrained Transformers to RNNs" (T2R) settings, reducing the need for extensive training from scratch. Extensive experiments confirm GSA's superior performance in scenarios requiring in-context recall and in T2R settings.

Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, Guohong Fu• 2024

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy57
1460
Commonsense ReasoningWinoGrande
Accuracy72.6
776
Question AnsweringARC Challenge
Accuracy46.9
749
Commonsense ReasoningPIQA
Accuracy73.5
647
Language ModelingWikiText
PPL14.8
479
Question AnsweringARC Easy
Accuracy76
386
Physical Commonsense ReasoningPIQA
Accuracy78.9
329
Multitask Language UnderstandingMMLU
Accuracy38.1
206
Language ModelingLAMBADA
Accuracy52.7
183
Sentence CompletionHellaSwag
Accuracy77.9
133
Showing 10 of 30 rows

Other info

Code

Follow for update