Learning to Reason under Off-Policy Guidance
About
Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning with verifiable rewards~(\textit{RLVR}). However, existing \textit{RLVR} approaches are inherently ``on-policy'', limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities. To address this issue, we introduce \textbf{LUFFY} (\textbf{L}earning to reason \textbf{U}nder o\textbf{FF}-polic\textbf{Y} guidance), a framework that augments \textit{RLVR} with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Specifically, LUFFY combines the Mixed-Policy GRPO framework, which has a theoretically guaranteed convergence rate, alongside policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Compared with previous RLVR methods, LUFFY achieves an over \textbf{+6.4} average gain across six math benchmarks and an advantage of over \textbf{+6.2} points in out-of-distribution tasks. Most significantly, we show that LUFFY successfully trains weak models in scenarios where on-policy RLVR completely fails. These results provide compelling evidence that LUFFY transcends the fundamental limitations of on-policy RLVR and demonstrates the great potential of utilizing off-policy guidance in RLVR.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | AMC 23 | Accuracy72.27 | 198 | |
| Mathematical Reasoning | Minerva | -- | 138 | |
| Mathematical Reasoning | Olympiad Bench | Pass@1 Accuracy43.4 | 115 | |
| Mathematical Reasoning | Olympiad | Accuracy51.61 | 92 | |
| Function Calling | BFCL V3 | Overall Accuracy49.23 | 88 | |
| Mathematical Reasoning | Minerva Math | pass@1 Accuracy39 | 82 | |
| Mathematical Reasoning | MATH 500 | Accuracy83.48 | 73 | |
| Mathematical Reasoning | Mathematical Reasoning Suite (AMC, AIME 2024, AIME 2025, Minerva, MATH, Olympiad) various (test val) | Average Score21.1 | 55 | |
| Mathematical Reasoning | AMC23 | Pass@161.2 | 43 | |
| Mathematical Reasoning | MathVerse | -- | 39 |