SFT-then-RL Outperforms Mixed-Policy Methods for LLM Reasoning
About
Recent mixed-policy optimization methods for LLM reasoning that interleave or blend supervised and reinforcement learning signals report improvements over the standard SFT-then-RL pipeline. We show that numerous recently published research papers rely on a faulty baseline caused by two distinct bugs: a CPU-offloaded optimizer bug in DeepSpeed that silently drops intermediate micro-batches during gradient accumulation (affecting multiple downstream frameworks including TRL, OpenRLHF and Llama-Factory), and a loss aggregation bug in OpenRLHF that incorrectly weights per-mini-batch losses. Together they suppress SFT performance, with the optimizer bug accounting for most of the gap and the loss aggregation bug contributing a smaller additional effect. Once corrected, the standard SFT-then-RL pipeline surpasses every published mixed-policy method we evaluate by +3.8 points on math benchmarks with Qwen2.5-Math-7B and by +22.2 points with Llama-3.1-8B. Even a truncated variant with just 50 RL steps outperforms mixed-policy methods on math benchmarks while using fewer FLOPs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH 500 | Accuracy (Acc)78.6 | 543 | |
| Mathematical Reasoning | AMC | Accuracy (%)53.9 | 368 | |
| Mathematical Reasoning | Minerva Math | Accuracy20.7 | 233 | |
| Scientific Reasoning | ARC Challenge | -- | 115 | |
| Mathematical Reasoning | MATH 500 | Pass@192 | 68 | |
| Mathematical Reasoning | Minerva | pass@1 Mean43.9 | 54 | |
| Mathematical Reasoning | Olympiad Math | Accuracy48.5 | 35 | |
| Mathematical Reasoning | OlympiadBench | Pass@162.8 | 33 | |
| General Knowledge | MMLU-Pro | pass@155.1 | 20 | |
| General Knowledge | GPQA Diamond | Pass@140.6 | 17 |