Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Cascaded Head-colliding Attention

About

Transformers have advanced the field of natural language processing (NLP) on a variety of important tasks. At the cornerstone of the Transformer architecture is the multi-head attention (MHA) mechanism which models pairwise interactions between the elements of the sequence. Despite its massive success, the current framework ignores interactions among different heads, leading to the problem that many of the heads are redundant in practice, which greatly wastes the capacity of the model. To improve parameter efficiency, we re-formulate the MHA as a latent variable model from a probabilistic perspective. We present cascaded head-colliding attention (CODA) which explicitly models the interactions between attention heads through a hierarchical variational distribution. We conduct extensive experiments and demonstrate that CODA outperforms the transformer baseline, by $0.6$ perplexity on \texttt{Wikitext-103} in language modeling, and by $0.6$ BLEU on \texttt{WMT14 EN-DE} in machine translation, due to its improvements on the parameter efficiency.\footnote{Our implementation is publicly available at \url{https://github.com/LZhengisme/CODA}.}

Lin Zheng, Zhiyong Wu, Lingpeng Kong• 2021

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText-103 (test)
Perplexity18.48
524
Language ModelingWikiText-103 (val)
PPL17.81
180
Machine TranslationWMT En-De '14
BLEU28
89
Machine TranslationIWSLT14 DE-EN
BLEU Score35.6
22
Showing 4 of 4 rows

Other info

Code

Follow for update