Scaling Reasoning Efficiently via Relaxed On-Policy Distillation
About
On-policy distillation is pivotal for transferring reasoning capabilities to capacity-constrained models, yet remains prone to instability and negative transfer. We show that on-policy distillation can be interpreted, both theoretically and empirically, as a form of policy optimization, where the teacher-student log-likelihood ratio acts as a token reward. From this insight, we introduce REOPOLD (Relaxed On-Policy Distillation) a framework that stabilizes optimization by relaxing the strict imitation constraints of standard on-policy distillation. Specifically, REOPOLD temperately and selectively leverages rewards from the teacher through mixture-based reward clipping, entropy-based token-level dynamic sampling, and a unified exploration-to-refinement training strategy. Empirically, REOPOLD surpasses its baselines with superior sample efficiency during training and enhanced test-time scaling at inference, across mathematical, visual, and agentic tool-use reasoning tasks. Specifically, REOPOLD outperforms recent RL approaches achieving 6.7~12x greater sample efficiency and enables a 7B student to match a 32B teacher in visual reasoning with a ~3.32x inference speedup.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Mathematical Reasoning | MathVista | Accuracy72.4 | 278 | |
| Mathematical Reasoning | Minerva Math | Accuracy38.6 | 209 | |
| Visual Mathematical Reasoning | MathVision | Accuracy29.21 | 186 | |
| Mathematical Reasoning | WeMath | Accuracy69.77 | 161 | |
| Visual Mathematical Reasoning | MathVerse | Accuracy51.43 | 135 | |
| Mathematical Reasoning | Olympiad Bench | Accuracy57.3 | 123 | |
| Mathematical Reasoning | AIME 25 | Mean Accuracy32.6 | 26 | |
| Mathematical Reasoning | Competition-level Math Benchmarks AIME24, AIME25, AMC23, MATH500, Olympiad, Minerva | -- | 21 | |
| Visual Reasoning | Geo3K | Accuracy53.58 | 10 | |
| Visual Perception | Hallusion | Accuracy70.14 | 10 |