Linformer: Self-Attention with Linear Complexity
About
Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, training and deploying these models can be prohibitively costly for long sequences, as the standard self-attention mechanism of the Transformer uses $O(n^2)$ time and space with respect to sequence length. In this paper, we demonstrate that the self-attention mechanism can be approximated by a low-rank matrix. We further exploit this finding to propose a new self-attention mechanism, which reduces the overall self-attention complexity from $O(n^2)$ to $O(n)$ in both time and space. The resulting linear transformer, the \textit{Linformer}, performs on par with standard Transformer models, while being much more memory- and time-efficient.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | CIFAR-100 (test) | Accuracy70.87 | 3518 | |
| Image Classification | CIFAR-10 (test) | Accuracy92.45 | 3381 | |
| Image Classification | ImageNet-1K 1.0 (val) | Top-1 Accuracy78.7 | 1866 | |
| Image Classification | ImageNet (val) | Top-1 Acc77.6 | 1206 | |
| Language Modeling | PTB | Perplexity48.9 | 650 | |
| Language Modeling | WikiText-103 (test) | Perplexity26.1 | 524 | |
| Natural Language Understanding | GLUE (dev) | SST-2 (Acc)93.1 | 504 | |
| Language Modeling | PTB (test) | Perplexity48.9 | 471 | |
| Image Classification | CIFAR-10 | -- | 471 | |
| Long-range sequence modeling | Long Range Arena (LRA) | Text Accuracy57.29 | 164 |