UL2: Unifying Language Learning Paradigms
About
Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives -- two concepts that are commonly conflated. Next, we present a generalized & unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. We then propose Mixture-of-Denoisers (MoD), a pre-training objective that combines diverse pre-training paradigms together. We furthermore introduce a notion of mode switching, wherein downstream fine-tuning is associated with specific pre-training schemes. We conduct extensive ablative experiments to compare multiple pre-training objectives and find that our method pushes the Pareto-frontier by outperforming T5 & GPT-like models across multiple diverse setups. By scaling our model up to 20B parameters, we achieve SOTA performance on 50 well-established supervised finetuning based NLP tasks. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization. On 0-shot MMLU, UL2 20B outperforms T0 and T5 models. UL2 20B also works well with chain-of-thought prompting and reasoning, making it an appealing choice for research into reasoning at a small to medium scale of 20B parameters. Finally, we apply FLAN instruction tuning to the UL2 20B model, achieving MMLU and Big-Bench scores competitive to FLAN-PaLM 62B. We release Flax-based T5X checkpoints for the UL2 20B & Flan-UL2 20B.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | GSM8K | Accuracy10.2 | 983 | |
| Multi-task Language Understanding | MMLU | Accuracy74.1 | 842 | |
| Reasoning | BBH | Accuracy46 | 507 | |
| Commonsense Reasoning | CSQA | Accuracy55.7 | 366 | |
| Summarization | XSum (test) | ROUGE-226.6 | 231 | |
| Grammatical Error Correction | CoNLL 2014 (test) | F0.5 Score67.5 | 207 | |
| Reasoning | ARC Easy | Accuracy69.8 | 183 | |
| Language Understanding | MMLU (test) | MMLU Average Accuracy58.1 | 136 | |
| Commonsense Reasoning | ARC Challenge | Accuracy49.5 | 132 | |
| Commonsense Reasoning | StrategyQA | Accuracy59 | 125 |