DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning
About
Reinforcement learning (RL) with large language models shows promise in complex reasoning. However, its progress is hindered by the lack of large-scale training data that is sufficiently challenging, contamination-free and verifiable. To this end, we introduce DeepMath-103K, a large-scale mathematical dataset designed with high difficulty (primarily levels 5-9), rigorous decontamination against numerous benchmarks, and verifiable answers for rule-based RL reward. It further includes three distinct R1 solutions adaptable for diverse training paradigms such as supervised fine-tuning (SFT). Spanning a wide range of mathematical topics, DeepMath-103K fosters the development of generalizable and advancing reasoning. Notably, models trained on DeepMath-103K achieve state-of-the-art results on challenging mathematical benchmarks and demonstrate generalization beyond math such as biology, physics and chemistry, underscoring its broad efficacy. Data: https://huggingface.co/datasets/zwhe99/DeepMath-103K.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | Omni-MATH | Accuracy45.4 | 93 | |
| Mathematical Reasoning | Minerva Math | pass@1 Accuracy45.4 | 91 | |
| Mathematical Reasoning | OlympiadBench Math | Accuracy60.2 | 84 | |
| Mathematical Reasoning | HMMT 2025 | Accuracy11.7 | 70 | |
| Mathematical Reasoning | AMC 2023 | Pass@164.7 | 67 | |
| Mathematical Reasoning | AIME 2025 | Accuracy31.7 | 59 | |
| Mathematical Reasoning | AIME 2024 | Pass@119.4 | 29 | |
| Scientific Reasoning | MMLU STEM | Accuracy72.7 | 27 | |
| Mathematical Problem Solving | IneqMath (IM) | Exact Match Accuracy76 | 12 | |
| Mathematical Problem Solving | Putnam-Axiom (PA) | Exact Match Acc39.1 | 12 |