Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond
About
This paper introduces Light-R1, an open-source suite for training long reasoning models using reproducible and cost-effective methodology. Given the proprietary nature of data used in the DeepSeek-R1 series, we develop an alternative approach leveraging exclusively public data and models. Our curriculum training progressively increases data difficulty, combined with multi-staged post-training. Our Light-R1-32B model, trained from Qwen2.5-32B-Instruct, outperforms DeepSeek-R1-Distill-Qwen-32B in math reasoning. Experimental results show that this curriculum approach becomes more effective when distinct, diverse datasets are available for different training stages: fine-tuning DeepSeek-R1-Distilled models (pre-tuned by DeepSeek team on proprietary data) with 3,000 challenging examples from our curriculum dataset yielded state-of-the-art 7B and 14B models, while the 32B model, Light-R1-32B-DS performed comparably to QwQ-32B and DeepSeek-R1. Furthermore, we extend our work by applying GRPO on long reasoning models. Our final Light-R1-14B-DS achieves SOTA performance among 14B models in math, with AIME24 & 25 scores of 74.0 and 60.2 respectively, surpassing many 32B models and DeepSeek-R1-Distill-Llama-70B. Despite math-focused training, Light-R1-14B-DS demonstrates strong cross-domain generalization. Light-R1 represents a significant advancement in making sophisticated reasoning models more accessible and implementable in real-world applications. Our models, training data and code have been made available at https://github.com/Qihoo360/Light-R1.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | OlympiadBench Math | Accuracy69.7 | 84 | |
| Mathematical Reasoning | Omni-MATH | Accuracy48.5 | 68 | |
| Mathematical Reasoning | HMMT 2025 | Accuracy25 | 38 | |
| Mathematical Reasoning | AIME 2025 | Accuracy31.3 | 37 | |
| Multi-domain language model evaluation | ODA benchmark suite (test) | General Accuracy64.9 | 21 | |
| Reasoning | Reasoning domain benchmarks ARC-C, BBH, GPQA, CALM, KOR-BENCH | ARC-C Score92.2 | 16 | |
| General Language Understanding and Reasoning | General domain benchmarks (test) | DROP Score83.4 | 16 | |
| Mathematical Reasoning | Math domain benchmarks (GSM8K, MATH500, Omni-Math, Olympiad, AIME'24) standard (test) | GSM8K Accuracy93.8 | 16 |