Reformer: The Efficient Transformer
About
Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O($L^2$) to O($L\log L$), where $L$ is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of $N$ times, where $N$ is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | CIFAR-100 (test) | Accuracy73.02 | 3518 | |
| Image Classification | CIFAR-10 (test) | Accuracy90.58 | 3381 | |
| Multivariate Forecasting | ETTh1 | MSE0.686 | 645 | |
| Time Series Forecasting | ETTh1 | MSE0.837 | 601 | |
| Language Modeling | WikiText-103 (test) | Perplexity26 | 524 | |
| Natural Language Understanding | GLUE | SST-250.92 | 452 | |
| Time Series Forecasting | ETTh2 | MSE3.527 | 438 | |
| Multivariate Time-series Forecasting | ETTm1 | MSE0.538 | 433 | |
| Time Series Forecasting | ETTm2 | MSE3.581 | 382 | |
| Machine Translation | WMT En-De 2014 (test) | BLEU29.1 | 379 |