Understanding R1-Zero-Like Training: A Critical Perspective

About

DeepSeek-R1-Zero has shown that reinforcement learning (RL) at scale can directly enhance the reasoning capabilities of LLMs without supervised fine-tuning. In this work, we critically examine R1-Zero-like training by analyzing its two core components: base models and RL. We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance. Our analysis reveals that DeepSeek-V3-Base already exhibit ''Aha moment'', while Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates, suggesting potential pretraining biases. Additionally, we identify an optimization bias in Group Relative Policy Optimization (GRPO), which artificially increases response length (especially for incorrect outputs) during training. To address this, we introduce Dr. GRPO, an unbiased optimization method that improves token efficiency while maintaining reasoning performance. Leveraging these insights, we present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing a new state-of-the-art. Our code is available at https://github.com/sail-sg/understand-r1-zero.

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH500 (test)	Accuracy73	895
Question Answering	ARC Challenge	Accuracy (ARC)78.8	598
Mathematical Reasoning	MATH 500	Accuracy (Acc)78	543
Mathematical Reasoning	AIME 2024	Accuracy40	479
Mathematical Reasoning	MATH 500	--	442
Code Generation	MBPP (test)	--	405
Mathematical Reasoning	MATH 500	Top-1 Accuracy89.6	384
Mathematical Reasoning	AMC	Accuracy (%)61.2	368
Mathematical Reasoning	AIME 24	Accuracy33.4	318
Mathematical Reasoning	AIME 2025	Accuracy6.7	311

Showing 10 of 228 rows

...

Other info

Follow for update

@wizwand_team Discord