The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
About
The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. In this work, we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl snapshots that produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. In addition, we introduce FineWeb-Edu, a 1.3-trillion token collection of educational text filtered from FineWeb. LLMs pretrained on FineWeb-Edu exhibit dramatically better performance on knowledge- and reasoning-intensive benchmarks like MMLU and ARC. Along with our datasets, we publicly release our data curation codebase and all of the models trained during our ablation experiments.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reasoning | BBH | Accuracy9.73 | 507 | |
| Commonsense Reasoning | StoryCloze | Accuracy66.76 | 34 | |
| Reading Comprehension | RACE-m | Accuracy0.2681 | 28 | |
| Zero-shot Language Understanding and Reasoning | BENCH-PROXY (MMLU, ANLI, HellaSwag, PIQA, SIQA, W.G., ARC-E, ARC-C, C.QA, WSC) (test) | MMLU32.94 | 24 | |
| General Language Understanding | 12-task evaluation suite (test) | Average Score59.1 | 20 | |
| Large Language Model Evaluation | 12-task evaluation suite composite (test) | Reading Comprehension Score49.6 | 14 | |
| Reading Comprehension | RACE | -- | 12 | |
| Natural Language Inference | AX-b | Accuracy55.25 | 9 | |
| Natural Language Inference | AX-g | Accuracy50 | 9 | |
| Zero-shot Language Modeling Evaluation | SmolLM Tasks (ARC, CommonsenseQA, HellaSwag, MMLU, OpenBookQA, PIQA, WinoGrande, TriviaQA) (test) | Average Rank2.4444 | 4 |