Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level

About

On-policy distillation (OPD) trains a student on its own trajectories with token-level teacher feedback and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its standard advantage weighted policy gradient suffers from three structural weaknesses, including high variance updates, vanishing gradients in zero-advantage regions, and exploration bottlenecks when corrective signals are insufficient. We therefore propose Asymmetric On-Policy Distillation (AOPD), which replaces ineffective negative reinforcement with localized divergence minimization in non-positive advantage regions while preserving positive reinforcement learning. Experiments on mathematical reasoning benchmarks show that AOPD consistently outperforms standard OPD, with average gains of 4.09 / 8.34 under strong / weak initialization, respectively. AOPD also maintains higher policy entropy during training and better capability retention during sequential tool-use adaptation.

Nan Jia, Haojin Yang, Xing Ma, Jiesong Lian, Shuailiang Zhang, Weipeng Zhang, Ke Zeng, Xunliang Cai, Zequn Sun• 2026

Related benchmarks

Task	Dataset	Result
Tool Use	ToolAlpaca	Tool Use Success Rate75	26
Mathematical Reasoning	HMMT 2025 (Feb)	Pass@138.33	21
Mathematical Reasoning	AIME 2024	Pass@166.87	18
Mathematical Reasoning	AIME 2025	Pass@155	18
Mathematical Reasoning	Mathematical Reasoning Average	Pass@153.4	18
Mathematical Reasoning	Math reasoning tasks	Math Avg. Pass@4 (Before)67.43	4

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord