DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning
About
Reinforcement learning (RL) with large language models shows promise in complex reasoning. However, its progress is hindered by the lack of large-scale training data that is sufficiently challenging, contamination-free and verifiable. To this end, we introduce DeepMath-103K, a large-scale mathematical dataset designed with high difficulty (primarily levels 5-9), rigorous decontamination against numerous benchmarks, and verifiable answers for rule-based RL reward. It further includes three distinct R1 solutions adaptable for diverse training paradigms such as supervised fine-tuning (SFT). Spanning a wide range of mathematical topics, DeepMath-103K fosters the development of generalizable and advancing reasoning. Notably, models trained on DeepMath-103K achieve state-of-the-art results on challenging mathematical benchmarks and demonstrate generalization beyond math such as biology, physics and chemistry, underscoring its broad efficacy. Data: https://huggingface.co/datasets/zwhe99/DeepMath-103K.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | OlympiadBench Math | Accuracy60.2 | 84 | |
| Mathematical Reasoning | Omni-MATH | Accuracy45.4 | 68 | |
| Mathematical Reasoning | HMMT 2025 | Accuracy11.7 | 38 | |
| Mathematical Reasoning | AIME 2025 | Accuracy31.7 | 37 | |
| Mathematical Problem Solving | IneqMath (IM) | Exact Match Accuracy76 | 12 | |
| Mathematical Problem Solving | Putnam-Axiom (PA) | Exact Match Acc39.1 | 12 | |
| Mathematical Problem Solving | MATH-Perturb MP-hard | Exact Match Accuracy53 | 12 | |
| Mathematical Problem Solving | MATH-Perturb MP-simple | Exact Match Accuracy72 | 12 | |
| Mathematical Problem Solving | TheoremQA TQ-Math | Exact Match Accuracy55.4 | 12 | |
| Lemma Judging | NaturalProofs (test) | Exact Match Accuracy60.8 | 12 |