R-Drop: Regularized Dropout for Neural Networks
About
Dropout is a powerful and widely used technique to regularize the training of deep neural networks. In this paper, we introduce a simple regularization strategy upon dropout in model training, namely R-Drop, which forces the output distributions of different sub models generated by dropout to be consistent with each other. Specifically, for each training sample, R-Drop minimizes the bidirectional KL-divergence between the output distributions of two sub models sampled by dropout. Theoretical analysis reveals that R-Drop reduces the freedom of the model parameters and complements dropout. Experiments on $\bf{5}$ widely used deep learning tasks ($\bf{18}$ datasets in total), including neural machine translation, abstractive summarization, language understanding, language modeling, and image classification, show that R-Drop is universally effective. In particular, it yields substantial improvements when applied to fine-tune large-scale pre-trained models, e.g., ViT, RoBERTa-large, and BART, and achieves state-of-the-art (SOTA) performances with the vanilla Transformer model on WMT14 English$\to$German translation ($\bf{30.91}$ BLEU) and WMT14 English$\to$French translation ($\bf{43.95}$ BLEU), even surpassing models trained with extra large-scale data and expert-designed advanced variants of Transformer models. Our code is available at GitHub{\url{https://github.com/dropreg/R-Drop}}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Natural Language Understanding | GLUE (dev) | SST-2 (Acc)96.9 | 504 | |
| Natural Language Understanding | GLUE (val) | -- | 170 | |
| Machine Translation | IWSLT De-En 2014 (test) | BLEU37.3 | 146 | |
| Machine Translation | WMT 2014 (test) | BLEU34.54 | 100 | |
| Machine Translation | IWSLT En-De 2014 (test) | BLEU37.25 | 92 | |
| Machine Translation | WMT14 English-French (newstest2014) | BLEU43.95 | 39 | |
| Machine Translation | IWSLT De-En 14 | BLEU Score37.25 | 33 | |
| AI-generated text detection | Cross-genre (test) | OA95 | 32 | |
| Semantic segmentation | DigestPath (test) | DSC70.37 | 29 | |
| AIGT detection | HC3 PWWS attack, AI to Human (in-domain) | Overall Accuracy99.75 | 28 |