DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning
About
Reinforcement learning (RL) with large language models shows promise in complex reasoning. However, its progress is hindered by the lack of large-scale training data that is sufficiently challenging, contamination-free and verifiable. To this end, we introduce DeepMath-103K, a large-scale mathematical dataset designed with high difficulty (primarily levels 5-9), rigorous decontamination against numerous benchmarks, and verifiable answers for rule-based RL reward. It further includes three distinct R1 solutions adaptable for diverse training paradigms such as supervised fine-tuning (SFT). Spanning a wide range of mathematical topics, DeepMath-103K fosters the development of generalizable and advancing reasoning. Notably, models trained on DeepMath-103K achieve state-of-the-art results on challenging mathematical benchmarks and demonstrate generalization beyond math such as biology, physics and chemistry, underscoring its broad efficacy. Data: https://huggingface.co/datasets/zwhe99/DeepMath-103K.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | AIME 2024 | Accuracy34.2 | 479 | |
| Mathematical Reasoning | AIME 2025 | Accuracy30 | 311 | |
| Mathematical Reasoning | HMMT 2025 | Accuracy11.7 | 194 | |
| Mathematical Reasoning | Omni-MATH | Accuracy45.4 | 123 | |
| Mathematical Reasoning | MATH 500 | Accuracy83.4 | 116 | |
| Mathematical Reasoning | Minerva Math | pass@1 Accuracy45.4 | 104 | |
| Mathematical Reasoning | OlympiadBench Math | Accuracy60.2 | 84 | |
| Mathematical Reasoning | AMC 2023 | Pass@164.7 | 67 | |
| Mathematical Reasoning | AIME 2025 | Accuracy31.7 | 59 | |
| Mathematical Reasoning | Math Reasoning Suite Average | Average Accuracy25.1 | 49 |