Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level

About

On-policy distillation (OPD) trains a student on its own trajectories with token-level teacher feedback and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its standard advantage weighted policy gradient suffers from three structural weaknesses, including high variance updates, vanishing gradients in zero-advantage regions, and exploration bottlenecks when corrective signals are insufficient. We therefore propose Asymmetric On-Policy Distillation (AOPD), which replaces ineffective negative reinforcement with localized divergence minimization in non-positive advantage regions while preserving positive reinforcement learning. Experiments on mathematical reasoning benchmarks show that AOPD consistently outperforms standard OPD, with average gains of 4.09 / 8.34 under strong / weak initialization, respectively. AOPD also maintains higher policy entropy during training and better capability retention during sequential tool-use adaptation.

Nan Jia, Haojin Yang, Xing Ma, Jiesong Lian, Shuailiang Zhang, Weipeng Zhang, Ke Zeng, Xunliang Cai, Zequn Sun• 2026

Related benchmarks

TaskDatasetResultRank
Tool UseToolAlpaca
Tool Use Success Rate75
26
Mathematical ReasoningHMMT 2025 (Feb)
Pass@138.33
21
Mathematical ReasoningAIME 2024
Pass@166.87
18
Mathematical ReasoningAIME 2025
Pass@155
18
Mathematical ReasoningMathematical Reasoning Average
Pass@153.4
18
Mathematical ReasoningMath reasoning tasks
Math Avg. Pass@4 (Before)67.43
4
Showing 6 of 6 rows

Other info

Follow for update