OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data

About

Mathematical reasoning continues to be a critical challenge in large language model (LLM) development with significant interest. However, most of the cutting-edge progress in mathematical reasoning with LLMs has become \emph{closed-source} due to lack of access to training data. This lack of data access limits researchers from understanding the impact of different choices for synthesizing and utilizing the data. With the goal of creating a high-quality finetuning (SFT) dataset for math reasoning, we conduct careful ablation experiments on data synthesis using the recently released \texttt{Llama3.1} family of models. Our experiments show that: (a) solution format matters, with excessively verbose solutions proving detrimental to SFT performance, (b) data generated by a strong teacher outperforms equally-sized data generated by a weak student model, (c) SFT is robust to low-quality solutions, allowing for imprecise data filtering, and (d) question diversity is crucial for achieving data scaling gains. Based on these insights, we create the OpenMathInstruct-2 dataset, which consists of 14M question-solution pairs ($\approx$ 600K unique questions), making it nearly eight times larger than the previous largest open-source math reasoning dataset. Finetuning the \texttt{Llama-3.1-8B-Base} using OpenMathInstruct-2 outperforms \texttt{Llama3.1-8B-Instruct} on MATH by an absolute 15.9\% (51.9\% $\rightarrow$ 67.8\%). Finally, to accelerate the open-source efforts, we release the code, the finetuned models, and the OpenMathInstruct-2 dataset under a commercially permissive license.

Shubham Toshniwal, Wei Du, Ivan Moshkov, Branislav Kisacanin, Alexan Ayrapetyan, Igor Gitman• 2024

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH500 (test)	Accuracy79.6	895
Mathematical Reasoning	GSM8K (test)	Accuracy92	816
Mathematical Reasoning	AIME 2024 (test)	Accuracy13.3	209
Mathematical Reasoning	HMMT 2025	Accuracy0.00e+0	194
Mathematical Reasoning	Omni-MATH	Accuracy22.5	123
Mathematical Reasoning	GSM8K v1 (test)	Accuracy84.9	118
Mathematical Reasoning	OlympiadBench Math	Accuracy30.7	84
Mathematical Reasoning	AIME 2025	Accuracy5	59
Mathematical Reasoning	Mathematical Reasoning Suite MATH 500, AIME 2024, AIME 2025, AMC 2023, Olympiad Bench	Average Score28.8	29
Out-of-Domain Reasoning	Out-of-Domain Reasoning Suite BGQA, CRUX Eval, Strategy QA, Table Bench	BGQA Accuracy68.7	9

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord