Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Learning to Reason under Off-Policy Guidance

About

Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning with verifiable rewards~(\textit{RLVR}). However, existing \textit{RLVR} approaches are inherently ``on-policy'', limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities. To address this issue, we introduce \textbf{LUFFY} (\textbf{L}earning to reason \textbf{U}nder o\textbf{FF}-polic\textbf{Y} guidance), a framework that augments \textit{RLVR} with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Specifically, LUFFY combines the Mixed-Policy GRPO framework, which has a theoretically guaranteed convergence rate, alongside policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Compared with previous RLVR methods, LUFFY achieves an over \textbf{+6.4} average gain across six math benchmarks and an advantage of over \textbf{+6.2} points in out-of-distribution tasks. Most significantly, we show that LUFFY successfully trains weak models in scenarios where on-policy RLVR completely fails. These results provide compelling evidence that LUFFY transcends the fundamental limitations of on-policy RLVR and demonstrates the great potential of utilizing off-policy guidance in RLVR.

Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, Yue Zhang• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAMC 23
Accuracy72.27
198
Mathematical ReasoningMinerva--
138
Mathematical ReasoningOlympiad Bench
Pass@1 Accuracy43.4
115
Mathematical ReasoningOlympiad
Accuracy51.61
92
Function CallingBFCL V3
Overall Accuracy49.23
88
Mathematical ReasoningMinerva Math
pass@1 Accuracy39
82
Mathematical ReasoningMATH 500
Accuracy83.48
73
Mathematical ReasoningMathematical Reasoning Suite (AMC, AIME 2024, AIME 2025, Minerva, MATH, Olympiad) various (test val)
Average Score21.1
55
Mathematical ReasoningAMC23
Pass@161.2
43
Mathematical ReasoningMathVerse--
39
Showing 10 of 33 rows

Other info

Follow for update