Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

About

Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, with respect to the input's length, they are prohibitively slow for very long sequences. To address this limitation, we express the self-attention as a linear dot-product of kernel feature maps and make use of the associativity property of matrix products to reduce the complexity from $\mathcal{O}\left(N^2\right)$ to $\mathcal{O}\left(N\right)$, where $N$ is the sequence length. We show that this formulation permits an iterative implementation that dramatically accelerates autoregressive transformers and reveals their relationship to recurrent neural networks. Our linear transformers achieve similar performance to vanilla transformers and they are up to 4000x faster on autoregressive prediction of very long sequences.

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, Fran\c{c}ois Fleuret• 2020

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText-103 (test)
Perplexity22.2
524
Natural Language UnderstandingGLUE (dev)
SST-2 (Acc)91.51
504
Natural Language UnderstandingGLUE
SST-284.63
452
Machine TranslationWMT En-De 2014 (test)
BLEU28.4
379
Machine TranslationWMT En-Fr 2014 (test)
BLEU41.8
237
Character-level Language Modelingenwik8 (test)
BPC1.207
195
Language ModelingWikiText-103 (val)
PPL27.44
180
Long-range sequence modelingLong Range Arena (LRA)
Text Accuracy65.9
164
Long-range sequence modelingLong Range Arena (LRA) (test)
Accuracy (Avg)50.5
158
Density EstimationCIFAR-10 (test)
Bits/dim3.4
134
Showing 10 of 94 rows
...

Other info

Code

Follow for update