Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Learning Deep Transformer Models for Machine Translation

About

Transformer is the state-of-the-art model in recent machine translation evaluations. Two strands of research are promising to improve models of this kind: the first uses wide networks (a.k.a. Transformer-Big) and has been the de facto standard for the development of the Transformer system, and the other uses deeper language representation but faces the difficulty arising from learning deep networks. Here, we continue the line of research on the latter. We claim that a truly deep Transformer model can surpass the Transformer-Big counterpart by 1) proper use of layer normalization and 2) a novel way of passing the combination of previous layers to the next. On WMT'16 English- German, NIST OpenMT'12 Chinese-English and larger WMT'18 Chinese-English tasks, our deep system (30/25-layer encoder) outperforms the shallow Transformer-Big/Base baseline (6-layer encoder) by 0.4-2.4 BLEU points. As another bonus, the deep model is 1.6X smaller in size and 3X faster in training than Transformer-Big.

Qiang Wang, Bei Li, Tong Xiao, Jingbo Zhu, Changliang Li, Derek F. Wong, Lidia S. Chao• 2019

Related benchmarks

TaskDatasetResultRank
Machine TranslationWMT En-Fr 2014 (test)
BLEU43.05
237
Table Question AnsweringWikiTableQuestions (test)--
86
Machine TranslationWMT EN-DE 2017 (test)
BLEU Score0.282
46
Machine TranslationWMT newstest 2015 (test)
BLEU30.24
31
Machine TranslationWMT newstest 2016 (test)
BLEU34.26
31
Machine TranslationWMT English-German (EN-DE) 2014 (newstest2014)
BLEU29.3
29
Table-based Question AnsweringWIKITABLEQUESTIONS (dev)--
25
Machine TranslationWMT newstest 2010 (test)
BLEU24.2
21
Table Question AnsweringWIKISQL WEAK (test)
Denotation Accuracy79.3
20
Zero-shot Downstream Reasoning and Knowledge TasksDownstream Reasoning Task Suite (ARC-E, ARC-C, HS, OBQA, PIQA, WG, Arith.) zero-shot
ARC-E78.1
19
Showing 10 of 24 rows

Other info

Follow for update