OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling

About

Different base language model families, such as Llama and Qwen, exhibit divergent behaviors during post-training with reinforcement learning (RL), especially on reasoning-intensive tasks. What makes a base language model suitable for reinforcement learning? Gaining deeper insight into this question is essential for developing RL-scalable foundation models of the next generation. In this work, we investigate how mid-training strategies shape RL dynamics, focusing on two representative model families: Qwen and Llama. Our study reveals that (1) high-quality mathematical corpora, such as MegaMath-Web-Pro, significantly improve both base model and RL performance, while existing alternatives (e.g., FineMath-4plus) fail to do so; (2) further adding QA-style data, particularly long chain-of-thought (CoT) reasoning examples, enhances RL outcomes, and instruction data further unlocks this effect; (3) while long-CoT improves reasoning depth, it can also induce verbosity of model responses and unstability of RL training, underscoring the importance of data formatting; (4) scaling mid-training consistently leads to stronger downstream RL performance. Building on these insights, we introduce a two-stage mid-training strategy, Stable-then-Decay, in which base models are first trained on 200B tokens with a constant learning rate, followed by 20B tokens across three CoT-focused branches with learning rate decay. This yields OctoThinker, a family of models demonstrating strong RL compatibility and closing the performance gap with more RL-friendly model families, i.e., Qwen. We hope our work will help shape pre-training strategies for foundation models in the RL era. To support further research, we release our open-source models along with a curated math reasoning-intensive corpus of over 70 billion tokens (i.e., MegaMath-Web-Pro-Max).

Zengzhi Wang, Fan Zhou, Xuefeng Li, Pengfei Liu• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH500 (test)	Accuracy65.6	895
Mathematical Reasoning	AMC 2023	Accuracy33.5	144
Mathematical Reasoning	Olympiad	Accuracy26.6	137
Mathematical Reasoning	Minerva	Accuracy (%)25.7	67
Mathematical Reasoning	MATH500	Accuracy (%)65.6	47
Mathematical Reasoning	Minerva	Avg@1613.3	43
Graduate-Level Reasoning	GPQA Diamond	Accuracy22.1	40
Math Reasoning	Olympiad	Average Rate @1613.2	38
Mathematical Reasoning	OlympiadBench	EM26.6	36
Mathematical Reasoning	AIME 2025	Accuracy (%)0.5	17

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord