The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

About

Large language models are commonly trained on a mixture of filtered web data and curated high-quality corpora, such as social media conversations, books, or technical papers. This curation process is believed to be necessary to produce performant models with broad zero-shot generalization abilities. However, as larger models requiring pretraining on trillions of tokens are considered, it is unclear how scalable is curation and whether we will run out of unique high-quality data soon. At variance with previous beliefs, we show that properly filtered and deduplicated web data alone can lead to powerful models; even significantly outperforming models from the state-of-the-art trained on The Pile. Despite extensive filtering, the high-quality data we extract from the web is still plentiful, and we are able to obtain five trillion tokens from CommonCrawl. We publicly release an extract of 600 billion tokens from our RefinedWeb dataset, and 1.3/7.5B parameters language models trained on it.

Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, Julien Launay• 2023

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	WinoGrande	Accuracy66.2	1442
Code Generation	HumanEval	Pass@10.61	1043
Multi-task Language Understanding	MMLU	Accuracy70.4	881
Mathematical Reasoning	GSM8K (test)	Accuracy19.6	816
Reasoning	BBH	Accuracy54	726
Physical Commonsense Reasoning	PIQA	Accuracy79.4	696
Code Generation	HumanEval (test)	Pass@10.00e+0	612
Multi-turn Dialogue Evaluation	MT-Bench	Overall Score5.17	532
Question Answering	OpenBookQA	Accuracy32	465
Mathematical Reasoning	MATH (test)	Overall Accuracy2.5	433

Showing 10 of 36 rows

Other info

Follow for update

@wizwand_team Discord