SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning

About

Recent mixed-policy optimization methods for LLM reasoning that interleave or blend supervised and reinforcement learning signals report improvements over the standard SFT-then-RL pipeline. We show that numerous recently published research papers rely on a faulty baseline caused by two distinct bugs: a CPU-offloaded optimizer bug in DeepSpeed that silently drops intermediate micro-batches during gradient accumulation (affecting multiple downstream frameworks including TRL, OpenRLHF and Llama-Factory), and a loss aggregation bug in OpenRLHF that incorrectly weights per-mini-batch losses. Together they suppress SFT performance, with the optimizer bug accounting for most of the gap and the loss aggregation bug contributing a smaller additional effect. Once corrected, the standard SFT-then-RL pipeline surpasses every published mixed-policy method we evaluate by +3.8 points on math benchmarks with Qwen2.5-Math-7B and by +22.2 points with Llama-3.1-8B. Even a truncated variant with just 50 RL steps outperforms mixed-policy methods on math benchmarks while using fewer FLOPs.

Alexis Limozin, Eduard Durech, Torsten Hoefler, Imanol Schlag, Valentina Pyatkin• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	Accuracy (Acc)78.6	600
Mathematical Reasoning	AMC	Accuracy (%)53.9	375
Mathematical Reasoning	Minerva Math	Accuracy20.7	251
Scientific Reasoning	ARC Challenge	--	121
Mathematical Reasoning	MATH 500	Pass@192	68
Mathematical Reasoning	Minerva	pass@1 Mean43.9	54
Mathematical Reasoning	Olympiad Math	Accuracy48.5	35
Mathematical Reasoning	OlympiadBench	Pass@162.8	33
General Knowledge	MMLU-Pro	pass@155.1	20
Mathematical Reasoning	AIME 24	Average Score (Top-32)40.4	20

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord