Counteraction-Aware Multi-Teacher On-Policy Distillation for General Capability Recovery with Domain Preservation

About

Domain specialization can improve LLM behavior in vertical domains, but often weakens the general capabilities inherited from the original model. Recent Multi-Teacher On-Policy Distillation (MOPD) pipelines recover model capabilities by supervising student-generated trajectories with teacher feedback, but typically assume teacher-aligned prompt coverage, requiring prompts to match the teachers' training distributions. This assumption is difficult to satisfy when the general teacher is an open-source model whose post-training data are unknown. Instead of attempting to reconstruct this hidden distribution, we study general capability recovery with readily available proxy general prompts. We identify two failure modes of vanilla MOPD in this incomplete-coverage situation: recovery-preservation counteraction from mixing conflicting recovery and preservation gradients, and weak-signal flattening from uniformly averaging samples with unequal correction demand. We propose Counteraction-Aware Multi-Teacher On-Policy Distillation (CaMOPD), which addresses these issues with decoupled alternating training and gap-based sample selection. CaMOPD gives general recovery dedicated updates, periodically reviews domain prompts for preservation, and selects samples with larger averaged token-level teacher-student log-probability gaps to concentrate correction signals. Across role-play dialogue and medical reasoning QA scenarios, CaMOPD performs best in general recovery over baselines while maintaining domain-specific behavior. Gradient coherence analyses further support the intended effect of CaMOPD in producing more coherent correction signals.

Tianlei Chen, Jiao Ou, Ziyuan Liu, Ruiming Tang, Jian Liang, Han Li• 2026

Related benchmarks

Task	Dataset	Result
Logical reasoning	ZebraLogic	Accuracy72.6	86
Mathematical Reasoning	HMMT 25	Accuracy (HMMT 25)34.58	50
Knowledge Reasoning	GPQA Diamond	Accuracy61.99	48
Instruction Following	IF-Eval	Accuracy59.89	14
Coding	LCB v5	Pass@156.63	6
General Evaluation	LiveBench 1125	Score49	6
Medical Reasoning QA	MedXpertQA text	Accuracy23.63	6
Preference-based Generation	Arena HP	Score16.4	6
Coding	LCB v6	Pass@129.71	6
Medical Reasoning QA	MedQA USMLE	Accuracy85.78	6

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord