Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Understanding the Failure of Batch Normalization for Transformers in NLP

About

Batch Normalization (BN) is a core and prevalent technique in accelerating the training of deep neural networks and improving the generalization on Computer Vision (CV) tasks. However, it fails to defend its position in Natural Language Processing (NLP), which is dominated by Layer Normalization (LN). In this paper, we are trying to answer why BN usually performs worse than LN in NLP tasks with Transformer models. We find that the inconsistency between training and inference of BN is the leading cause that results in the failure of BN in NLP. We define Training Inference Discrepancy (TID) to quantitatively measure this inconsistency and reveal that TID can indicate BN's performance, supported by extensive experiments, including image classification, neural machine translation, language modeling, sequence labeling, and text classification tasks. We find that BN can obtain much better test performance than LN when TID keeps small through training. To suppress the explosion of TID, we propose Regularized BN (RBN) that adds a simple regularization term to narrow the gap between batch statistics and population statistics of BN. RBN improves the performance of BN consistently and outperforms or is on par with LN on 17 out of 20 settings, involving ten datasets and two common variants of Transformer Our code is available at https://github.com/wjxts/RegularizedBN.

Jiaxi Wang, Ji Wu, Lei Huang• 2022

Related benchmarks

TaskDatasetResultRank
Language ModelingPTB
Perplexity43.2
650
Language ModelingWikiText-103
PPL17.1
146
Text ClassificationDBpedia (DBP)
Accuracy97.6
110
Text ClassificationIMDB
Accuracy84.5
107
Named Entity RecognitionCoNLL 03--
102
Named Entity RecognitionRESUME
F1 Score94.8
52
Text ClassificationYelp
Accuracy93.6
21
Machine TranslationIWSLT 2014
BLEU35.6
20
Neural Machine TranslationWMT16
BLEU26.5
14
Text ClassificationSogou
Accuracy94.7
6
Showing 10 of 10 rows

Other info

Code

Follow for update