Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe

About

On-policy distillation (OPD) has recently emerged as an effective post-training paradigm for consolidating the capabilities of specialized expert models into a single student model. Despite its empirical success, the conditions under which OPD yields reliable improvement remain poorly understood. In this work, we identify two fundamental bottlenecks that limit effective OPD: insufficient exploration of informative states and unreliable teacher supervision for student rollouts. Building on this insight, we propose Uni-OPD, a unified OPD framework that generalizes across Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs), centered on a dual-perspective optimization strategy. Specifically, from the student's perspective, we adopt two data balancing strategies to promote exploration of informative student-generated states during training. From the teacher's perspective, we show that reliable supervision hinges on whether aggregated token-level guidance remains order-consistent with the outcome reward. To this end, we develop an outcome-guided margin calibration mechanism to restore order consistency between correct and incorrect trajectories. We conduct extensive experiments on 5 domains and 16 benchmarks covering diverse settings, including single-teacher and multi-teacher distillation across LLMs and MLLMs, strong-to-weak distillation, and cross-modal distillation. Our results verify the effectiveness and versatility of Uni-OPD and provide practical insights into reliable OPD.

Wenjin Hou, Shangpin Peng, Weinong Wang, Zheng Ruan, Yue Zhang, Zhenglin Zhou, Mingqi Gao, Yifei Chen, Kaiqi Wang, Hongming Yang, Chengquan Zhang, Zhuotao Tian, Han Hu, Yi Yang, Fei Wu, Hehe Fan• 2026

Related benchmarks

TaskDatasetResultRank
Code GenerationHumanEval+--
393
Multimodal Logical ReasoningLogicVista
Accuracy42
63
Math ReasoningAIME 2025
Accuracy57.2
60
Mathematical ReasoningHMMT 25
Accuracy (HMMT 25)39.8
50
Code GenerationMBPP+
Pass@172.6
40
Math ReasoningAIME 2024
Accuracy63.3
39
Code GenerationHumanEval+
Pass@188.3
34
Math ReasoningHMMT25
Accuracy (HMMT25)34.9
21
Multimodal Mathematical ReasoningWeMath mini (test)
Accuracy65
18
Multimodal ReasoningVisuLogic
Pass@127.6
17
Showing 10 of 26 rows

Other info

Follow for update