Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention

About

Despite many recent works on Mixture of Experts (MoEs) for resource-efficient Transformer language models, existing methods mostly focus on MoEs for feedforward layers. Previous attempts at extending MoE to the self-attention layer fail to match the performance of the parameter-matched baseline. Our novel SwitchHead is an effective MoE method for the attention layer that successfully reduces both the compute and memory requirements, achieving wall-clock speedup, while matching the language modeling performance of the baseline Transformer. Our novel MoE mechanism allows SwitchHead to compute up to 8 times fewer attention matrices than the standard Transformer. SwitchHead can also be combined with MoE feedforward layers, resulting in fully-MoE "SwitchAll" Transformers. For our 262M parameter model trained on C4, SwitchHead matches the perplexity of standard models with only 44% compute and 27% memory usage. Zero-shot experiments on downstream tasks confirm the performance of SwitchHead, e.g., achieving more than 3.5% absolute improvements on BliMP compared to the baseline with an equal compute resource.

R\'obert Csord\'as, Piotr Pi\k{e}kos, Kazuki Irie, J\"urgen Schmidhuber• 2023

Related benchmarks

TaskDatasetResultRank
Language ModelingC4
Perplexity16.45
1182
Language ModelingWikiText-103 (test)
Perplexity9.81
524
Language ModelingC4 (test)
Perplexity15.43
268
Language ModelingWikiText-103
PPL9.55
146
Language ModelingLAMBADA zero-shot (test)
Accuracy (zero-shot)30.2
44
Zero-shot Language ModelingBLiMP (test)
Accuracy79.6
8
Zero-shot Language ModelingCBT (test)
Accuracy84.2
4
Language ModelingpeS2o
PPL9.86
4
Showing 8 of 8 rows

Other info

Code

Follow for update