Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Understanding the Difficulty of Training Transformers

About

Transformers have proved effective in many NLP tasks. However, their training requires non-trivial efforts regarding designing cutting-edge optimizers and learning rate schedulers carefully (e.g., conventional SGD fails to train Transformers effectively). Our objective here is to understand $\textit{what complicates Transformer training}$ from both empirical and theoretical perspectives. Our analysis reveals that unbalanced gradients are not the root cause of the instability of training. Instead, we identify an amplification effect that influences training substantially -- for each layer in a multi-layer Transformer model, heavy dependency on its residual branch makes training unstable, since it amplifies small parameter perturbations (e.g., parameter updates) and results in significant disturbances in the model output. Yet we observe that a light dependency limits the model potential and leads to inferior trained models. Inspired by our analysis, we propose Admin ($\textbf{Ad}$aptive $\textbf{m}$odel $\textbf{in}$itialization) to stabilize stabilize the early stage's training and unleash its full potential in the late stage. Extensive experiments show that Admin is more stable, converges faster, and leads to better performance. Implementations are released at: https://github.com/LiyuanLucasLiu/Transforemr-Clinic.

Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, Jiawei Han• 2020

Related benchmarks

TaskDatasetResultRank
Machine TranslationWMT En-De 2014 (test)
BLEU29.11
379
Machine TranslationWMT En-Fr 2014 (test)
BLEU43.8
237
Machine TranslationWMT English-German 2014 (test)
BLEU27.9
136
Machine TranslationIWSLT En-De 2014 (test)
BLEU36.1
92
Machine TranslationWMT EN-DE 2017 (test)
BLEU Score0.288
46
Machine TranslationWMT En-Fr 2014
BLEU43.8
42
Machine TranslationWMT14 English-French (newstest2014)
BLEU43.8
39
Machine TranslationWMT newstest 2015 (test)
BLEU30.35
31
Machine TranslationWMT newstest 2016 (test)
BLEU34.12
31
Machine TranslationWMT English-German (EN-DE) 2014 (newstest2014)
BLEU30.01
29
Showing 10 of 17 rows

Other info

Code

Follow for update