Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Are Sixteen Heads Really Better than One?

About

Attention is a powerful and ubiquitous mechanism for allowing neural models to focus on particular salient pieces of information by taking their weighted average when making predictions. In particular, multi-headed attention is a driving force behind many recent state-of-the-art NLP models such as Transformer-based MT models and BERT. These models apply multiple attention mechanisms in parallel, with each attention "head" potentially focusing on different parts of the input, which makes it possible to express sophisticated functions beyond the simple weighted average. In this paper we make the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance. In fact, some layers can even be reduced to a single head. We further examine greedy algorithms for pruning down models, and the potential speed, memory efficiency, and accuracy improvements obtainable therefrom. Finally, we analyze the results with respect to which parts of the model are more reliant on having multiple heads, and provide precursory evidence that training dynamics play a role in the gains provided by multi-head attention.

Paul Michel, Omer Levy, Graham Neubig• 2019

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet-1K 1.0 (val)
Top-1 Accuracy80.1
40
Natural Language UnderstandingGLUE
CoLA Score45.7
15
Circuit DiscoveryIOI
AUC83.6
12
Circuit DiscoveryDocstring
AUC0.889
12
Circuit DiscoveryGreater-than
AUC0.706
12
Circuit DiscoveryDocstring
KL Divergence0.805
6
Circuit DiscoveryGreaterthan
KL Divergence0.642
6
Circuit DiscoveryIOI
KL Div0.668
6
Circuit DiscoveryTracr Proportion
Loss0.679
6
Circuit DiscoveryTracr-Reverse
Loss0.577
6
Showing 10 of 11 rows

Other info

Follow for update