Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Co-Evolving Policy Distillation

About

RLVR and OPD have become standard paradigms for post-training. We provide a unified analysis of these two paradigms in consolidating multiple expert capabilities into a single model, identifying capability loss in different ways: mixed RLVR suffers from inter-capability divergence cost, while the pipeline of first training experts and then performing OPD, though avoiding divergence, fails to fully absorb teacher capabilities due to large behavioral pattern gaps between teacher and student. We propose Co-Evolving Policy Distillation (CoPD), which encourages parallel training of experts and introduces OPD during each expert's ongoing RLVR training rather than after complete expert training, with experts serving as mutual teachers (making OPD bidirectional) to co-evolve. This enables more consistent behavioral patterns among experts while maintaining sufficient complementary knowledge throughout. Experiments validate that CoPD achieves all-in-one integration of text, image, and video reasoning capabilities, significantly outperforming strong baselines such as mixed RLVR and MOPD, and even surpassing domain-specific experts. The model parallel training pattern offered by CoPD may inspire a novel training scaling paradigm.

Naibin Gu, Chenxu Yang, Qingyi Si, Chuanyu Qin, Dingyu Yao, Peng Fu, Zheng Lin, Weiping Wang, Nan Duan, Jiaqi Wang• 2026

Related benchmarks

TaskDatasetResultRank
Visual Mathematical ReasoningMathVista
Accuracy75.75
366
Video ReasoningVideo-Holmes
Accuracy43.77
83
Video ReasoningVideoMathQA
Accuracy55.76
61
Visual ReasoningMathVerse
Accuracy59.34
40
Video ReasoningMVBench--
39
Image ReasoningWeMath
Accuracy59.9
34
ReasoningAIME 2025
AIME 2025 Accuracy49.58
19
Image ReasoningMathVista
Accuracy76.8
17
Image ReasoningMMMU
Accuracy66.94
17
Image ReasoningMMMU-Pro
Accuracy55.12
14
Showing 10 of 18 rows

Other info

Follow for update