Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Scaling Reasoning Efficiently via Relaxed On-Policy Distillation

About

On-policy distillation is pivotal for transferring reasoning capabilities to capacity-constrained models, yet remains prone to instability and negative transfer. We show that on-policy distillation can be interpreted, both theoretically and empirically, as a form of policy optimization, where the teacher-student log-likelihood ratio acts as a token reward. From this insight, we introduce REOPOLD (Relaxed On-Policy Distillation) a framework that stabilizes optimization by relaxing the strict imitation constraints of standard on-policy distillation. Specifically, REOPOLD temperately and selectively leverages rewards from the teacher through mixture-based reward clipping, entropy-based token-level dynamic sampling, and a unified exploration-to-refinement training strategy. Empirically, REOPOLD surpasses its baselines with superior sample efficiency during training and enhanced test-time scaling at inference, across mathematical, visual, and agentic tool-use reasoning tasks. Specifically, REOPOLD outperforms recent RL approaches achieving 6.7~12x greater sample efficiency and enables a 7B student to match a 32B teacher in visual reasoning with a ~3.32x inference speedup.

Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, Pashmina Cameron• 2026

Related benchmarks

TaskDatasetResultRank
Visual Mathematical ReasoningMathVista
Accuracy72.4
366
Visual Mathematical ReasoningMathVision
Accuracy29.21
254
Mathematical ReasoningMinerva Math
Accuracy38.6
228
Mathematical ReasoningWeMath
Accuracy69.77
225
Mathematical ReasoningOlympiad Bench
Accuracy57.3
222
Mathematical ReasoningAIME 2024
Accuracy36.97
220
Mathematical ReasoningAIME 2025
Accuracy30.83
214
Visual Mathematical ReasoningMathVerse
Accuracy51.43
155
Mathematical ReasoningAIME 25--
112
Code GenerationLiveCodeBench v6
Accuracy19.43
75
Showing 10 of 40 rows

Other info

Follow for update