Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Reformer: The Efficient Transformer

About

Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O($L^2$) to O($L\log L$), where $L$ is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of $N$ times, where $N$ is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.

Nikita Kitaev, {\L}ukasz Kaiser, Anselm Levskaya• 2020

Related benchmarks

TaskDatasetResultRank
Image ClassificationCIFAR-100 (test)
Accuracy73.02
3518
Image ClassificationCIFAR-10 (test)
Accuracy90.58
3381
Multivariate ForecastingETTh1
MSE0.686
645
Time Series ForecastingETTh1
MSE0.837
601
Language ModelingWikiText-103 (test)
Perplexity26
524
Natural Language UnderstandingGLUE
SST-250.92
452
Time Series ForecastingETTh2
MSE3.527
438
Multivariate Time-series ForecastingETTm1
MSE0.538
433
Time Series ForecastingETTm2
MSE3.581
382
Machine TranslationWMT En-De 2014 (test)
BLEU29.1
379
Showing 10 of 262 rows
...

Other info

Code

Follow for update