Better Fine-Tuning by Reducing Representational Collapse
About
Although widely adopted, existing approaches for fine-tuning pre-trained language models have been shown to be unstable across hyper-parameter settings, motivating recent work on trust region methods. In this paper, we present a simplified and efficient method rooted in trust region theory that replaces previously used adversarial objectives with parametric noise (sampling from either a normal or uniform distribution), thereby discouraging representation change during fine-tuning when possible without hurting performance. We also introduce a new analysis to motivate the use of trust region methods more generally, by studying representational collapse; the degradation of generalizable representations from pre-trained models as they are fine-tuned for a specific end task. Extensive experiments show that our fine-tuning method matches or exceeds the performance of previous trust region methods on a range of understanding and generation tasks (including DailyMail/CNN, Gigaword, Reddit TIFU, and the GLUE benchmark), while also being much faster. We also show that it is less prone to representation collapse; the pre-trained models maintain more generalizable representations every time they are fine-tuned.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Natural Language Understanding | GLUE (dev) | SST-2 (Acc)97.1 | 504 | |
| Natural Language Understanding | GLUE (val) | -- | 170 | |
| Summarization | Gigaword (test) | ROUGE-220.7 | 38 | |
| Natural Language Inference | XNLI 1.0 (test) | Accuracy81.4 | 38 | |
| Abstractive Summarization | CNN/DailyMail | ROUGE-144.38 | 25 | |
| Binary Classification | AdvGLUE (test) | QNLI Accuracy0.475 | 17 | |
| Summarization | CNN-DM (test) | ROUGE-137.28 | 11 | |
| Summarization | Reddit TIFU Long (test) | ROUGE-130.31 | 4 | |
| Summarization | RedditTIFU long | R130.31 | 3 |