Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Scaling Reasoning Efficiently via Relaxed On-Policy Distillation

About

On-policy distillation is pivotal for transferring reasoning capabilities to capacity-constrained models, yet remains prone to instability and negative transfer. We show that on-policy distillation can be interpreted, both theoretically and empirically, as a form of policy optimization, where the teacher-student log-likelihood ratio acts as a token reward. From this insight, we introduce REOPOLD (Relaxed On-Policy Distillation) a framework that stabilizes optimization by relaxing the strict imitation constraints of standard on-policy distillation. Specifically, REOPOLD temperately and selectively leverages rewards from the teacher through mixture-based reward clipping, entropy-based token-level dynamic sampling, and a unified exploration-to-refinement training strategy. Empirically, REOPOLD surpasses its baselines with superior sample efficiency during training and enhanced test-time scaling at inference, across mathematical, visual, and agentic tool-use reasoning tasks. Specifically, REOPOLD outperforms recent RL approaches achieving 6.7~12x greater sample efficiency and enables a 7B student to match a 32B teacher in visual reasoning with a ~3.32x inference speedup.

Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, Pashmina Cameron• 2026

Related benchmarks

TaskDatasetResultRank
Visual Mathematical ReasoningMathVista
Accuracy72.4
278
Mathematical ReasoningMinerva Math
Accuracy38.6
209
Visual Mathematical ReasoningMathVision
Accuracy29.21
186
Mathematical ReasoningWeMath
Accuracy69.77
161
Visual Mathematical ReasoningMathVerse
Accuracy51.43
135
Mathematical ReasoningOlympiad Bench
Accuracy57.3
123
Mathematical ReasoningAIME 25
Mean Accuracy32.6
26
Mathematical ReasoningCompetition-level Math Benchmarks AIME24, AIME25, AMC23, MATH500, Olympiad, Minerva--
21
Visual ReasoningGeo3K
Accuracy53.58
10
Visual PerceptionHallusion
Accuracy70.14
10
Showing 10 of 10 rows

Other info

Follow for update