GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory

About

Modern autoregressive models rely on attention, yet the Softmax full attention in Transformers scales quadratically with sequence length. Sliding Window Attention (SWA) achieves linear-time encoding/decoding by constraining the attention pattern, but under an \textit{Associative Memory} interpretation, its difference-style update renders the training objective effectively \emph{unbounded}. In contrast, Softmax attention normalizes updates, leading to \emph{memory shrinkage and gradient vanishing}. We propose GatedFWA: a Memory-\underline{Gated} (\underline{F}lash) \underline{W}indowed \underline{A}ttention mechanism that preserves SWAs efficiency while stabilizing memory updates and making gradient flow controllable. In essence, GatedFWA accumulate a per-token/head gate into a decay bias added to the attention logits, acting as a learnable contraction in the memory recurrence. We implement a fused one-pass gate preprocessing and a FlashAttention-compatible kernel that injects the gate under a sliding mask, ensuring I/O efficiency and numerical stability. On language modelling benchmarks, GatedFWA delivers competitive throughput with negligible overhead and better use of global context, and it integrates cleanly with token compression/selection methods such as NSA and generalizes to various autoregressive domains.

Jiaxu Liu, Yuhe Bai, Xiangyu Yin, Christos-Savvas Bouganis• 2025

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	Accuracy35.1	1896
Commonsense Reasoning	WinoGrande	Accuracy50.77	1442
Commonsense Reasoning	PIQA	Accuracy64.86	757
Question Answering	ARC Easy	Accuracy47.2	597
Question Answering	BoolQ	--	317
Common Sense Reasoning	COPA	Accuracy67.2	256
Language Modeling	OpenWebText (val)	Validation Loss2.842	114
Question Answering	ARC Challenge	Normalized Accuracy25.52	105
Question Answering	OpenBookQA	Normalized Accuracy30.8	102
Question Answering	SciQA	Normalized Accuracy76.2	10

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord