Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Trust Region On-Policy Distillation

About

On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even cause optimization failure. This work addresses reliable on-policy token-level supervision through credit assignment strategies, and proposes Trust Region On-Policy Distillation, TrOPD. It features the following characteristics: 1) Trust-Region On-Policy Learning: TrOPD performs OPD only in regions where the teacher provides reliable supervision, mitigating the optimization difficulty of the K1 reverse-KL estimator under distribution mismatch. 2) Outlier Estimation: For outlier regions, we explore gradient clipping, masking, and forward-KL estimation to reduce the adverse effects of unreliable supervision. 3) Off-Policy Guidance: The student continues generation from teacher prefixes and uses forward KL to imitate off-policy guidance, encouraging on-policy exploration toward reliable regions. Experiments show that TrOPD consistently outperforms SoTA OPD baselines, including OPD, EOPD, and REOPOLD, across mathematical reasoning, code generation, and general-domain benchmarks.

Xingrun Xing, Haoqing Wang, Boyan Gao, Ziheng Li, Yehui Tang• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningAIME 2024
Accuracy38.54
220
Mathematical ReasoningAIME 2025
Accuracy32.5
214
Code GenerationLiveCodeBench v6
Accuracy22.29
75
Mathematical ReasoningAMC 23
Accuracy77.03
69
Scientific ReasoningGPQA Diamond
Accuracy36.24
62
Instruction FollowingIFBench
IFBench Score42.18
56
Code GenerationLiveCodeBench v6
LCB.v6 Score36
7
General Multi-domain ReasoningMulti-domain aggregate
Average Score51.73
7
Mathematical ReasoningAIME 2024, AIME 2025, AMC 2023
AIME 2024 Score52.08
7
STEM Knowledge and ReasoningGPQA Diamond MMLU Reduced
GPQA Diamond Accuracy35.98
7
Showing 10 of 10 rows

Other info

Follow for update