Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text

About

There is growing evidence that pretraining on high quality, carefully thought-out tokens such as code or mathematics plays an important role in improving the reasoning abilities of large language models. For example, Minerva, a PaLM model finetuned on billions of tokens of mathematical documents from arXiv and the web, reported dramatically improved performance on problems that require quantitative reasoning. However, because all known open source web datasets employ preprocessing that does not faithfully preserve mathematical notation, the benefits of large scale training on quantitive web documents are unavailable to the research community. We introduce OpenWebMath, an open dataset inspired by these works containing 14.7B tokens of mathematical webpages from Common Crawl. We describe in detail our method for extracting text and LaTeX content and removing boilerplate from HTML documents, as well as our methods for quality filtering and deduplication. Additionally, we run small-scale experiments by training 1.4B parameter language models on OpenWebMath, showing that models trained on 14.7B tokens of our dataset surpass the performance of models trained on over 20x the amount of general language data. We hope that our dataset, openly released on the Hugging Face Hub, will help spur advances in the reasoning abilities of large language models.

Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, Jimmy Ba• 2023

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Accuracy44.12
983
Mathematical ReasoningMATH
Accuracy14.26
643
ReasoningBBH
Accuracy56.5
507
Commonsense ReasoningPIQA 1.0 (test)
Accuracy82.21
48
Commonsense ReasoningHellaSwag 1.0 (test)
Accuracy62.21
17
Commonsense ReasoningWinoGrande 1.0 (test)
Accuracy0.8019
15
World Knowledge and Reading ComprehensionLM Evaluation Harness NQ, MMLU STEM, ARC, SciQ, LogiQA, BoolQ
NQ Accuracy29.17
15
Math problem solvingGSM8k, SAT-Math, & MATH OpenCompass AGIEval sampled (test)
GSM8k Accuracy28.51
4
Showing 8 of 8 rows

Other info

Follow for update