Trust Region On-Policy Distillation

About

On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even cause optimization failure. This work addresses reliable on-policy token-level supervision through credit assignment strategies, and proposes Trust Region On-Policy Distillation, TrOPD. It features the following characteristics: 1) Trust-Region On-Policy Learning: TrOPD performs OPD only in regions where the teacher provides reliable supervision, mitigating the optimization difficulty of the K1 reverse-KL estimator under distribution mismatch. 2) Outlier Estimation: For outlier regions, we explore gradient clipping, masking, and forward-KL estimation to reduce the adverse effects of unreliable supervision. 3) Off-Policy Guidance: The student continues generation from teacher prefixes and uses forward KL to imitate off-policy guidance, encouraging on-policy exploration toward reliable regions. Experiments show that TrOPD consistently outperforms SoTA OPD baselines, including OPD, EOPD, and REOPOLD, across mathematical reasoning, code generation, and general-domain benchmarks.

Xingrun Xing, Haoqing Wang, Boyan Gao, Ziheng Li, Yehui Tang• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AIME 2024	Accuracy38.54	394
Mathematical Reasoning	AIME 2025	Accuracy32.5	378
Code Generation	LiveCodeBench v6	Accuracy22.29	91
Mathematical Reasoning	AMC 23	Accuracy77.03	83
Scientific Reasoning	GPQA Diamond	Accuracy36.24	73
Instruction Following	IFBench	IFBench Score42.18	68
Mathematical Reasoning	DAPO-Math OOD average (avg of four benchmarks)	Avg@1654.8	30
Mathematical Reasoning	DAPO-Math In-Distribution (test)	Avg@1681.6	30
Code Generation	LiveCodeBench v6	LCB.v6 Score36	7
General Multi-domain Reasoning	Multi-domain aggregate	Average Score51.73	7

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord