Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Teacher-Guided Policy Optimization for On-Policy Reasoning Distillation under Large Policy Divergence

About

On-policy distillation (OPD) has become a promising paradigm for reasoning-oriented post-training of large language models (LLMs), especially when combined with reinforcement learning from verifiable rewards (RLVR). Existing OPD methods rely on reverse KL (RKL)-based teacher supervision over trajectories sampled from the student policy. However, we identify a critical limitation: under large teacher--student policy divergence, RL-driven exploration often produces trajectories outside the teacher distribution, resulting in uninformative negative feedback. To address this, we propose Teacher-Guided Policy Optimization (TGPO), an on-policy reasoning distillation method that remains effective under large policy divergence settings. Rather than relying solely on evaluative supervision, TGPO uses teacher to directly guide token level generation conditioning on student-generated contexts; together with RLVR-style trajectory level rewards, TGPO steers exploration toward improved continuations. Experiments on reasoning benchmarks show that TGPO consistently outperforms existing RKL-based OPD methods and remains robust across different teacher models.

Xinyu Liu, Kechen Jiao, Chunyang Xiao, Runsong Zhao, Junhao Ruan, Bei Li, Jiahao Liu, Qifan Wang, Xin Chen, Jingang Wang, Chenglong Wang, Tong Xiao, JingBo Zhu• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMinerva
Pass@1 Accuracy40.4
289
Mathematical ReasoningMATH 500
Pass@1 Rate84.4
236
Mathematical ReasoningAMC
Average Pass@3260.2
44
Question AnsweringARC Challenge
Pass@182.8
18
Multitask Knowledge EvaluationMMLU-Pro
Pass@150.1
14
Scientific Question AnsweringGPQA Diamond
pass@137.9
14
Showing 6 of 6 rows

Other info

Follow for update