Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models
About
Pre-trained language models (LMs) are shown to easily generate toxic language. In this work, we systematically explore domain-adaptive training to reduce the toxicity of language models. We conduct this study on three dimensions: training corpus, model size, and parameter efficiency. For the training corpus, we propose to leverage the generative power of LMs and generate nontoxic datasets for domain-adaptive training, which mitigates the exposure bias and is shown to be more data-efficient than using a curated pre-training corpus. We demonstrate that the self-generation method consistently outperforms the existing baselines across various model sizes on both automatic and human evaluations, even when it uses a 1/3 smaller training corpus. We then comprehensively study detoxifying LMs with parameter sizes ranging from 126M up to 530B (3x larger than GPT-3), a scale that has never been studied before. We find that i) large LMs have similar toxicity levels as smaller ones given the same pre-training corpus, and ii) large LMs require more endeavor to detoxify. We also explore parameter-efficient training methods for detoxification. We demonstrate that adding and training adapter-only layers in LMs not only saves a lot of parameters but also achieves a better trade-off between toxicity and perplexity than whole model adaptation for the large-scale models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Language Modeling | (val) | Perplexity7.86 | 30 | |
| Toxicity Evaluation | RealToxicityPrompts | -- | 29 | |
| Detoxification | RealToxicityPrompts | Avg Max Toxicity0.27 | 22 | |
| Utility Evaluation | Downstream Tasks | Average Accuracy62.6 | 12 | |
| Toxicity Analysis | RealToxicityPrompts Nontoxic | Exp. Max. Toxicity0.22 | 10 | |
| Zero-shot Task Evaluation | 9 Downstream Tasks Utility | Average Accuracy54.7 | 10 | |
| Language Modeling | LM (val) | Validation PPL11.14 | 9 |