Multi-branch Attentive Transformer

About

While the multi-branch architecture is one of the key ingredients to the success of computer vision tasks, it has not been well investigated in natural language processing, especially sequence learning tasks. In this work, we propose a simple yet effective variant of Transformer called multi-branch attentive Transformer (briefly, MAT), where the attention layer is the average of multiple branches and each branch is an independent multi-head attention layer. We leverage two training techniques to regularize the training: drop-branch, which randomly drops individual branches during training, and proximal initialization, which uses a pre-trained Transformer model to initialize multiple branches. Experiments on machine translation, code generation and natural language understanding demonstrate that such a simple variant of Transformer brings significant improvements. Our code is available at \url{https://github.com/HA-Transformer}.

Yang Fan, Shufang Xie, Yingce Xia, Lijun Wu, Tao Qin, Xiang-Yang Li, Tie-Yan Liu• 2020

Related benchmarks

Task	Dataset	Result
Machine Translation	WMT En-De 2014 (test)	BLEU30.8	379
Natural Language Understanding	GLUE (val)	SST-297	201
Machine Translation	IWSLT De-En 2014 (test)	BLEU36.22	146
Machine Translation	IWSLT German-to-English '14 (test)	BLEU Score36.2	110
Machine Translation	IWSLT En-De 2014 (test)	BLEU29.9	92
Machine Translation	WMT En-De '14	BLEU29.9	89
Machine Translation	WMT En-De 2019 (test)	SacreBLEU40.4	37
Machine Translation	IWSLT De-En 14	BLEU Score36.22	35
Code Generation	Java dataset (test)	BLEU27.53	6
Code Generation	Python dataset (test)	BLEU16.66	6

Showing 10 of 11 rows

Other info

Code

Follow for update

@wizwand_team Discord