Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

JustRL: Scaling a 1.5B LLM with a Simple RL Recipe

About

Recent advances in reinforcement learning for large language models have converged on increasing complexity: multi-stage training pipelines, dynamic hyperparameter schedules, and curriculum learning strategies. This raises a fundamental question: \textbf{Is this complexity necessary?} We present \textbf{JustRL}, a minimal approach using single-stage training with fixed hyperparameters that achieves state-of-the-art performance on two 1.5B reasoning models (54.9\% and 64.3\% average accuracy across nine mathematical benchmarks) while using 2$\times$ less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training exhibits smooth, monotonic improvement over 4,000+ steps without the collapses or plateaus that typically motivate interventions. Critically, ablations reveal that adding ``standard tricks'' like explicit length penalties and robust verifiers may degrade performance by collapsing exploration. These results suggest that the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline. We release our models and code to establish a simple, validated baseline for the community.

Bingxiang He, Zekai Qu, Zeyuan Liu, Yinghao Chen, Yuxin Zuo, Cheng Qian, Kaiyan Zhang, Weize Chen, Chaojun Xiao, Ganqu Cui, Ning Ding, Zhiyuan Liu• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME 2025
Accuracy62.92
227
Mathematical ReasoningAIME 25
Accuracy38.96
201
Mathematical ReasoningAMC 23
Accuracy89.61
198
Mathematical ReasoningMinerva--
138
Mathematical ReasoningMATH 500
Accuracy91.65
119
Mathematical ReasoningAIME 24
Accuracy52.08
113
Mathematical ReasoningAMC 2023
Accuracy96.02
65
Mathematical ReasoningOlympiad
Accuracy68.81
50
Mathematical ReasoningOlympiadBench
Accuracy0.7659
34
Mathematical ReasoningAIME 2024
Accuracy69.69
25
Showing 10 of 16 rows

Other info

GitHub

Follow for update