Gated Slot Attention for Efficient Linear-Time Sequence Modeling

About

Linear attention Transformers and their gated variants, celebrated for enabling parallel training and efficient recurrent inference, still fall short in recall-intensive tasks compared to traditional Transformers and demand significant resources for training from scratch. This paper introduces Gated Slot Attention (GSA), which enhances Attention with Bounded-memory-Control (ABC) by incorporating a gating mechanism inspired by Gated Linear Attention (GLA). Essentially, GSA comprises a two-layer GLA linked via $\operatorname{softmax}$, utilizing context-aware memory reading and adaptive forgetting to improve memory capacity while maintaining compact recurrent state size. This design greatly enhances both training and inference efficiency through GLA's hardware-efficient training algorithm and reduced state size. Additionally, retaining the $\operatorname{softmax}$ operation is particularly beneficial in "finetuning pretrained Transformers to RNNs" (T2R) settings, reducing the need for extensive training from scratch. Extensive experiments confirm GSA's superior performance in scenarios requiring in-context recall and in T2R settings.

Yu Zhang, Songlin Yang, Ruijie Zhu, Yue Zhang, Leyang Cui, Yiqiao Wang, Bolun Wang, Freda Shi, Bailin Wang, Wei Bi, Peng Zhou, Guohong Fu• 2024

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	Accuracy57	1896
Commonsense Reasoning	WinoGrande	Accuracy72.6	1442
Question Answering	ARC Challenge	Accuracy46.9	906
Commonsense Reasoning	PIQA	Accuracy73.5	757
Language Modeling	WikiText	PPL14.8	740
Physical Commonsense Reasoning	PIQA	Accuracy78.9	696
Question Answering	ARC Easy	Accuracy76	597
Multitask Language Understanding	MMLU	Accuracy38.1	520
Language Modeling	LAMBADA	Accuracy52.7	412
Sentence Completion	HellaSwag	Accuracy77.9	364

Showing 10 of 33 rows

Other info

Code

Follow for update

@wizwand_team Discord