Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
About
On-policy distillation (OPD) has recently emerged as an effective post-training paradigm for consolidating the capabilities of specialized expert models into a single student model. Despite its empirical success, the conditions under which OPD yields reliable improvement remain poorly understood. In this work, we identify two fundamental bottlenecks that limit effective OPD: insufficient exploration of informative states and unreliable teacher supervision for student rollouts. Building on this insight, we propose Uni-OPD, a unified OPD framework that generalizes across Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs), centered on a dual-perspective optimization strategy. Specifically, from the student's perspective, we adopt two data balancing strategies to promote exploration of informative student-generated states during training. From the teacher's perspective, we show that reliable supervision hinges on whether aggregated token-level guidance remains order-consistent with the outcome reward. To this end, we develop an outcome-guided margin calibration mechanism to restore order consistency between correct and incorrect trajectories. We conduct extensive experiments on 5 domains and 16 benchmarks covering diverse settings, including single-teacher and multi-teacher distillation across LLMs and MLLMs, strong-to-weak distillation, and cross-modal distillation. Our results verify the effectiveness and versatility of Uni-OPD and provide practical insights into reliable OPD.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Code Generation | HumanEval+ | -- | 393 | |
| Multimodal Logical Reasoning | LogicVista | Accuracy42 | 63 | |
| Math Reasoning | AIME 2025 | Accuracy57.2 | 60 | |
| Mathematical Reasoning | HMMT 25 | Accuracy (HMMT 25)39.8 | 50 | |
| Code Generation | MBPP+ | Pass@172.6 | 40 | |
| Math Reasoning | AIME 2024 | Accuracy63.3 | 39 | |
| Code Generation | HumanEval+ | Pass@188.3 | 34 | |
| Math Reasoning | HMMT25 | Accuracy (HMMT25)34.9 | 21 | |
| Multimodal Mathematical Reasoning | WeMath mini (test) | Accuracy65 | 18 | |
| Multimodal Reasoning | VisuLogic | Pass@127.6 | 17 |