Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
About
We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training on the base model focusing on scalability, simplicity and accessibility. Through extensive experiments, we demonstrate that a minimalist approach, vanilla PPO with GAE ($\lambda=1$, $\gamma=1$) and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both benchmark performance and response length, replicating the scaling phenomenon observed in DeepSeek-R1-Zero. Using the same base model, Qwen2.5-32B base, as DeepSeek-R1-Zero-Qwen-32B, our implementation achieves superior performance across AIME2024, MATH500, and GPQA Diamond, while demonstrating remarkable efficiency, requiring only 1/10 of the training steps compared to the DeepSeek-R1-Zero pipeline. Moreover, our analysis not only covers training dynamics and ablation for critical design choices, but also quantitatively shows how the learned critic in Reasoner-Zero training effectively identifies and devalues repetitive response patterns, yielding more robust advantage estimations and enhancing training stability. Embracing the principles of open-source, we release our source code, training data, and various model weights, fostering reproducibility and encouraging further exploration of the properties of related models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH 500 | Accuracy (Acc)82.4 | 543 | |
| Mathematical Reasoning | AIME 2024 | Accuracy16.7 | 479 | |
| Mathematical Reasoning | MATH 500 | Top-1 Accuracy82.4 | 384 | |
| Mathematical Reasoning | AMC | Accuracy (%)57.5 | 368 | |
| Mathematical Reasoning | AIME 24 | Accuracy16.5 | 318 | |
| Mathematical Reasoning | Minerva | Pass@1 Accuracy33.1 | 289 | |
| Mathematical Reasoning | MATH 500 | Pass@1 Rate80.8 | 236 | |
| Mathematical Reasoning | AIME 2024 | Pass@1 Accuracy13.3 | 236 | |
| Mathematical Reasoning | Minerva Math | Accuracy34.2 | 228 | |
| Mathematical Reasoning | Olympiad Bench | Accuracy45.6 | 222 |