Reformer: The Efficient Transformer

About

Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O($L^2$) to O($L\log L$), where $L$ is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of $N$ times, where $N$ is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.

Nikita Kitaev, {\L}ukasz Kaiser, Anselm Levskaya• 2020

Related benchmarks

Task	Dataset	Result
Image Classification	CIFAR-100 (test)	Accuracy73.02	3518
Image Classification	CIFAR-10 (test)	Accuracy90.58	3381
Time Series Forecasting	ETTh1	MSE0.837	836
Multivariate Forecasting	ETTh1	MSE0.686	830
Time Series Forecasting	ETTh2	MSE3.527	796
Language Modeling	WikiText-103 (test)	Perplexity26	703
Multivariate Time-series Forecasting	ETTm1	MSE0.538	686
Long-term time-series forecasting	ETTh1	MAE0.805	575
Natural Language Understanding	GLUE	SST-250.92	551
Multivariate Time-series Forecasting	ETTm2	MSE0.658	539

Showing 10 of 345 rows

...

Other info

Code

Follow for update

@wizwand_team Discord