An Attention Free Transformer

About

We introduce Attention Free Transformer (AFT), an efficient variant of Transformers that eliminates the need for dot product self attention. In an AFT layer, the key and value are first combined with a set of learned position biases, the result of which is multiplied with the query in an element-wise fashion. This new operation has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible to both large input and model sizes. We also introduce AFT-local and AFT-conv, two model variants that take advantage of the idea of locality and spatial weight sharing while maintaining global connectivity. We conduct extensive experiments on two autoregressive modeling tasks (CIFAR10 and Enwik8) as well as an image recognition task (ImageNet-1K classification). We show that AFT demonstrates competitive performance on all the benchmarks, while providing excellent efficiency at the same time.

Shuangfei Zhai, Walter Talbott, Nitish Srivastava, Chen Huang, Hanlin Goh, Ruixiang Zhang, Josh Susskind• 2021

Related benchmarks

Task	Dataset	Result
Time Series Forecasting	ETTh1	MSE0.421	836
Time Series Forecasting	ETTh2	MSE0.342	796
Time Series Forecasting	ETTm2	MSE0.245	536
Character-level Language Modeling	enwik8 (test)	BPC1.209	195
Time Series Forecasting	Weather	MSE0.221	25
Character-level Language Modeling	enwik8 (train)	BPC1.046	12
Time Series Forecasting	solar	MSE0.198	9
Time Series Forecasting	ETTm1	MSE0.351	9

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord