DeepNet: Scaling Transformers to 1,000 Layers
About
In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, we introduce a new normalization function (DeepNorm) to modify the residual connection in Transformer, accompanying with theoretically derived initialization. In-depth theoretical analysis shows that model updates can be bounded in a stable way. The proposed method combines the best of two worlds, i.e., good performance of Post-LN and stable training of Pre-LN, making DeepNorm a preferred alternative. We successfully scale Transformers up to 1,000 layers (i.e., 2,500 attention and feed-forward network sublayers) without difficulty, which is one order of magnitude deeper than previous deep Transformers. Remarkably, on a multilingual benchmark with 7,482 translation directions, our 200-layer model with 3.2B parameters significantly outperforms the 48-layer state-of-the-art model with 12B parameters by 5 BLEU points, which indicates a promising scaling direction.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Machine Translation | WMT En-Fr 2014 (test) | BLEU43.93 | 237 | |
| Machine Translation | WMT EN-DE 2017 (test) | BLEU Score0.297 | 46 | |
| Machine Translation | WMT newstest 2015 (test) | BLEU30.6 | 31 | |
| Machine Translation | WMT newstest 2016 (test) | BLEU34.39 | 31 | |
| Machine Translation | WMT newstest 2010 (test) | BLEU24.7 | 21 | |
| Language Modeling | Pre-training corpus (train) | Perplexity22.77 | 20 | |
| Machine Translation | OPUS-100 (test) | Average BLEU Score32.1 | 19 | |
| Language Modeling | Language Modeling Corpus (val) | Average Perplexity11.47 | 19 | |
| Zero-shot Downstream Reasoning and Knowledge Tasks | Downstream Reasoning Task Suite (ARC-E, ARC-C, HS, OBQA, PIQA, WG, Arith.) zero-shot | ARC-E68.4 | 19 | |
| Machine Translation | WMT news Average 2010-2016 (test) | Average BLEU27.13 | 17 |