Rethinking Attention with Performers
About
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | CIFAR-100 (test) | Accuracy73.11 | 3518 | |
| Image Classification | CIFAR-10 (test) | Accuracy91.58 | 3381 | |
| Image Classification | ImageNet-1k (val) | Top-1 Accuracy79.5 | 840 | |
| Language Modeling | PTB | Perplexity49.1 | 650 | |
| Language Modeling | WikiText-103 (test) | Perplexity26.8 | 524 | |
| Natural Language Understanding | GLUE (dev) | SST-2 (Acc)91.97 | 504 | |
| Natural Language Understanding | GLUE | SST-283.8 | 452 | |
| Character-level Language Modeling | enwik8 (test) | BPC1.199 | 195 | |
| Language Modeling | WikiText-103 (val) | PPL62.5 | 180 | |
| Long-range sequence modeling | Long Range Arena (LRA) | Text Accuracy65.4 | 164 |