Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

DeepNet: Scaling Transformers to 1,000 Layers

About

In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, we introduce a new normalization function (DeepNorm) to modify the residual connection in Transformer, accompanying with theoretically derived initialization. In-depth theoretical analysis shows that model updates can be bounded in a stable way. The proposed method combines the best of two worlds, i.e., good performance of Post-LN and stable training of Pre-LN, making DeepNorm a preferred alternative. We successfully scale Transformers up to 1,000 layers (i.e., 2,500 attention and feed-forward network sublayers) without difficulty, which is one order of magnitude deeper than previous deep Transformers. Remarkably, on a multilingual benchmark with 7,482 translation directions, our 200-layer model with 3.2B parameters significantly outperforms the 48-layer state-of-the-art model with 12B parameters by 5 BLEU points, which indicates a promising scaling direction.

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Dongdong Zhang, Furu Wei• 2022

Related benchmarks

TaskDatasetResultRank
Machine TranslationWMT En-Fr 2014 (test)
BLEU43.93
237
Machine TranslationWMT EN-DE 2017 (test)
BLEU Score0.297
46
Machine TranslationWMT newstest 2015 (test)
BLEU30.6
31
Machine TranslationWMT newstest 2016 (test)
BLEU34.39
31
Machine TranslationWMT newstest 2010 (test)
BLEU24.7
21
Language ModelingPre-training corpus (train)
Perplexity22.77
20
Machine TranslationOPUS-100 (test)
Average BLEU Score32.1
19
Language ModelingLanguage Modeling Corpus (val)
Average Perplexity11.47
19
Zero-shot Downstream Reasoning and Knowledge TasksDownstream Reasoning Task Suite (ARC-E, ARC-C, HS, OBQA, PIQA, WG, Arith.) zero-shot
ARC-E68.4
19
Machine TranslationWMT news Average 2010-2016 (test)
Average BLEU27.13
17
Showing 10 of 17 rows

Other info

Follow for update