The Pile: An 800GB Dataset of Diverse Text for Language Modeling
About
Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present \textit{the Pile}: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets -- both existing and newly constructed -- many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations. Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Physical Commonsense Reasoning | PIQA | Accuracy72.9 | 696 | |
| Natural Language Inference | RTE | Accuracy53.4 | 590 | |
| Commonsense Reasoning | WinoGrande | Accuracy57.8 | 453 | |
| Multi-task Language Understanding | MMLU | MMLU Accuracy25 | 442 | |
| Sentence Completion | HellaSwag | Accuracy55.2 | 364 | |
| Boolean Question Answering | BoolQ | Accuracy61.7 | 350 | |
| Multiple-choice Question Answering | ARC Challenge (test) | Accuracy29.8 | 57 | |
| OpenBook Question Answering | OBQA | Accuracy34.6 | 32 | |
| Zero-shot Evaluation Aggregate | ARC-C, BoolQ, HellaSwag, MMLU, OBQA, PIQA, RTE, Winogrande Aggregate | Average Accuracy48.8 | 13 | |
| Language Understanding | General Understanding Tasks ARC-E, BoolQ, Wino., PIQA, HellaSwag, TruthfulQA, OBQA, LogiQA | ARC-E Accuracy60.5 | 8 |