Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning

About

Recent mixed-policy optimization methods for LLM reasoning that interleave or blend supervised and reinforcement learning signals report improvements over the standard SFT-then-RL pipeline. We show that numerous recently published research papers rely on a faulty baseline caused by two distinct bugs: a CPU-offloaded optimizer bug in DeepSpeed that silently drops intermediate micro-batches during gradient accumulation (affecting multiple downstream frameworks including TRL, OpenRLHF and Llama-Factory), and a loss aggregation bug in OpenRLHF that incorrectly weights per-mini-batch losses. Together they suppress SFT performance, with the optimizer bug accounting for most of the gap and the loss aggregation bug contributing a smaller additional effect. Once corrected, the standard SFT-then-RL pipeline surpasses every published mixed-policy method we evaluate by +3.8 points on math benchmarks with Qwen2.5-Math-7B and by +22.2 points with Llama-3.1-8B. Even a truncated variant with just 50 RL steps outperforms mixed-policy methods on math benchmarks while using fewer FLOPs.

Alexis Limozin, Eduard Durech, Torsten Hoefler, Imanol Schlag, Valentina Pyatkin• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH 500
Accuracy (Acc)78.6
543
Mathematical ReasoningAMC
Accuracy (%)53.9
368
Mathematical ReasoningMinerva Math
Accuracy20.7
233
Scientific ReasoningARC Challenge--
115
Mathematical ReasoningMATH 500
Pass@192
68
Mathematical ReasoningMinerva
pass@1 Mean43.9
54
Mathematical ReasoningOlympiad Math
Accuracy48.5
35
Mathematical ReasoningOlympiadBench
Pass@162.8
33
General KnowledgeMMLU-Pro
pass@155.1
20
General KnowledgeGPQA Diamond
Pass@140.6
17
Showing 10 of 14 rows

Other info

Follow for update