SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
About
Despite many recent works on Mixture of Experts (MoEs) for resource-efficient Transformer language models, existing methods mostly focus on MoEs for feedforward layers. Previous attempts at extending MoE to the self-attention layer fail to match the performance of the parameter-matched baseline. Our novel SwitchHead is an effective MoE method for the attention layer that successfully reduces both the compute and memory requirements, achieving wall-clock speedup, while matching the language modeling performance of the baseline Transformer. Our novel MoE mechanism allows SwitchHead to compute up to 8 times fewer attention matrices than the standard Transformer. SwitchHead can also be combined with MoE feedforward layers, resulting in fully-MoE "SwitchAll" Transformers. For our 262M parameter model trained on C4, SwitchHead matches the perplexity of standard models with only 44% compute and 27% memory usage. Zero-shot experiments on downstream tasks confirm the performance of SwitchHead, e.g., achieving more than 3.5% absolute improvements on BliMP compared to the baseline with an equal compute resource.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Modeling | C4 | Perplexity16.45 | 1182 | |
| Language Modeling | WikiText-103 (test) | Perplexity9.81 | 524 | |
| Language Modeling | C4 (test) | Perplexity15.43 | 268 | |
| Language Modeling | WikiText-103 | PPL9.55 | 146 | |
| Language Modeling | LAMBADA zero-shot (test) | Accuracy (zero-shot)30.2 | 44 | |
| Zero-shot Language Modeling | BLiMP (test) | Accuracy79.6 | 8 | |
| Zero-shot Language Modeling | CBT (test) | Accuracy84.2 | 4 | |
| Language Modeling | peS2o | PPL9.86 | 4 |