Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model

About

We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training on the base model focusing on scalability, simplicity and accessibility. Through extensive experiments, we demonstrate that a minimalist approach, vanilla PPO with GAE ($\lambda=1$, $\gamma=1$) and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both benchmark performance and response length, replicating the scaling phenomenon observed in DeepSeek-R1-Zero. Using the same base model, Qwen2.5-32B base, as DeepSeek-R1-Zero-Qwen-32B, our implementation achieves superior performance across AIME2024, MATH500, and GPQA Diamond, while demonstrating remarkable efficiency, requiring only 1/10 of the training steps compared to the DeepSeek-R1-Zero pipeline. Moreover, our analysis not only covers training dynamics and ablation for critical design choices, but also quantitatively shows how the learned critic in Reasoner-Zero training effectively identifies and devalues repetitive response patterns, yielding more robust advantage estimations and enhancing training stability. Embracing the principles of open-source, we release our source code, training data, and various model weights, fostering reproducibility and encouraging further exploration of the properties of related models.

Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, Heung-Yeung Shum• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAMC
Accuracy54.2
151
Mathematical ReasoningMinerva--
138
Mathematical ReasoningAIME 24
Accuracy13.3
113
Mathematical ReasoningMATH 500
MATH 500 Accuracy82.4
106
Mathematical ReasoningOlympiadBench
Accuracy0.479
34
Mathematical ReasoningIn-Distribution Reasoning Performance Suite (AIME, AMC, MATH-500, Minerva, Olympiad)
AIME 2024 Score16.5
30
ReasoningOut-of-Domain Reasoning Suite
ARC-c Score66.2
29
Mathematical ReasoningOlympiad
Accuracy (%)47.9
21
Out-of-Distribution ReasoningOOD Reasoning Datasets (MMLU-S, GPQA, ARC, BBH) (test)
GPQA29.8
20
General Domain ReasoningGeneral Domain Reasoning benchmarks ARC-c, MMLU-Pro
ARC-c66.2
12
Showing 10 of 13 rows

Other info

Follow for update