Scaling Reasoning Efficiently via Relaxed On-Policy Distillation

About

On-policy distillation is pivotal for transferring reasoning capabilities to capacity-constrained models, yet remains prone to instability and negative transfer. We show that on-policy distillation can be interpreted, both theoretically and empirically, as a form of policy optimization, where the teacher-student log-likelihood ratio acts as a token reward. From this insight, we introduce REOPOLD (Relaxed On-Policy Distillation) a framework that stabilizes optimization by relaxing the strict imitation constraints of standard on-policy distillation. Specifically, REOPOLD temperately and selectively leverages rewards from the teacher through mixture-based reward clipping, entropy-based token-level dynamic sampling, and a unified exploration-to-refinement training strategy. Empirically, REOPOLD surpasses its baselines with superior sample efficiency during training and enhanced test-time scaling at inference, across mathematical, visual, and agentic tool-use reasoning tasks. Specifically, REOPOLD outperforms recent RL approaches achieving 6.7~12x greater sample efficiency and enables a 7B student to match a 32B teacher in visual reasoning with a ~3.32x inference speedup.

Jongwoo Ko, Sara Abdali, Young Jin Kim, Tianyi Chen, Pashmina Cameron• 2026

Related benchmarks

Task	Dataset	Result
Visual Mathematical Reasoning	MathVista	Accuracy72.4	366
Visual Mathematical Reasoning	MathVision	Accuracy29.21	254
Mathematical Reasoning	Minerva Math	Accuracy38.6	228
Mathematical Reasoning	WeMath	Accuracy69.77	225
Mathematical Reasoning	Olympiad Bench	Accuracy57.3	222
Mathematical Reasoning	AIME 2024	Accuracy36.97	220
Mathematical Reasoning	AIME 2025	Accuracy30.83	214
Visual Mathematical Reasoning	MathVerse	Accuracy51.43	155
Mathematical Reasoning	AIME 25	--	112
Code Generation	LiveCodeBench v6	Accuracy19.43	75

Showing 10 of 40 rows

Other info

Follow for update

@wizwand_team Discord