Trust Region On-Policy Distillation
About
On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even cause optimization failure. This work addresses reliable on-policy token-level supervision through credit assignment strategies, and proposes Trust Region On-Policy Distillation, TrOPD. It features the following characteristics: 1) Trust-Region On-Policy Learning: TrOPD performs OPD only in regions where the teacher provides reliable supervision, mitigating the optimization difficulty of the K1 reverse-KL estimator under distribution mismatch. 2) Outlier Estimation: For outlier regions, we explore gradient clipping, masking, and forward-KL estimation to reduce the adverse effects of unreliable supervision. 3) Off-Policy Guidance: The student continues generation from teacher prefixes and uses forward KL to imitate off-policy guidance, encouraging on-policy exploration toward reliable regions. Experiments show that TrOPD consistently outperforms SoTA OPD baselines, including OPD, EOPD, and REOPOLD, across mathematical reasoning, code generation, and general-domain benchmarks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | AIME 2024 | Accuracy38.54 | 220 | |
| Mathematical Reasoning | AIME 2025 | Accuracy32.5 | 214 | |
| Code Generation | LiveCodeBench v6 | Accuracy22.29 | 75 | |
| Mathematical Reasoning | AMC 23 | Accuracy77.03 | 69 | |
| Scientific Reasoning | GPQA Diamond | Accuracy36.24 | 62 | |
| Instruction Following | IFBench | IFBench Score42.18 | 56 | |
| Code Generation | LiveCodeBench v6 | LCB.v6 Score36 | 7 | |
| General Multi-domain Reasoning | Multi-domain aggregate | Average Score51.73 | 7 | |
| Mathematical Reasoning | AIME 2024, AIME 2025, AMC 2023 | AIME 2024 Score52.08 | 7 | |
| STEM Knowledge and Reasoning | GPQA Diamond MMLU Reduced | GPQA Diamond Accuracy35.98 | 7 |