Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

An Attention Free Transformer

About

We introduce Attention Free Transformer (AFT), an efficient variant of Transformers that eliminates the need for dot product self attention. In an AFT layer, the key and value are first combined with a set of learned position biases, the result of which is multiplied with the query in an element-wise fashion. This new operation has a memory complexity linear w.r.t. both the context size and the dimension of features, making it compatible to both large input and model sizes. We also introduce AFT-local and AFT-conv, two model variants that take advantage of the idea of locality and spatial weight sharing while maintaining global connectivity. We conduct extensive experiments on two autoregressive modeling tasks (CIFAR10 and Enwik8) as well as an image recognition task (ImageNet-1K classification). We show that AFT demonstrates competitive performance on all the benchmarks, while providing excellent efficiency at the same time.

Shuangfei Zhai, Walter Talbott, Nitish Srivastava, Chen Huang, Hanlin Goh, Ruixiang Zhang, Josh Susskind• 2021

Related benchmarks

TaskDatasetResultRank
Time Series ForecastingETTh1
MSE0.421
601
Time Series ForecastingETTh2
MSE0.342
438
Time Series ForecastingETTm2
MSE0.245
382
Character-level Language Modelingenwik8 (test)
BPC1.209
195
Time Series ForecastingWeather
MSE0.221
25
Character-level Language Modelingenwik8 (train)
BPC1.046
12
Time Series Forecastingsolar
MSE0.198
9
Time Series ForecastingETTm1
MSE0.351
9
Showing 8 of 8 rows

Other info

Follow for update