The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
About
As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multilingual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH500 (test) | -- | 895 | |
| Mathematical Reasoning | AIME 2024 (test) | -- | 209 | |
| Math Reasoning | AMC 2023 (test) | Pass@128 | 57 | |
| Mathematical Reasoning | GSM8K (test) | Pass@1 Accuracy75.8 | 34 | |
| Code Reasoning | CRUX official (test) | Pass@1 Accuracy42.5 | 20 |