Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Counteraction-Aware Multi-Teacher On-Policy Distillation for General Capability Recovery with Domain Preservation

About

Domain specialization can improve LLM behavior in vertical domains, but often weakens the general capabilities inherited from the original model. Recent Multi-Teacher On-Policy Distillation (MOPD) pipelines recover model capabilities by supervising student-generated trajectories with teacher feedback, but typically assume teacher-aligned prompt coverage, requiring prompts to match the teachers' training distributions. This assumption is difficult to satisfy when the general teacher is an open-source model whose post-training data are unknown. Instead of attempting to reconstruct this hidden distribution, we study general capability recovery with readily available proxy general prompts. We identify two failure modes of vanilla MOPD in this incomplete-coverage situation: recovery-preservation counteraction from mixing conflicting recovery and preservation gradients, and weak-signal flattening from uniformly averaging samples with unequal correction demand. We propose Counteraction-Aware Multi-Teacher On-Policy Distillation (CaMOPD), which addresses these issues with decoupled alternating training and gap-based sample selection. CaMOPD gives general recovery dedicated updates, periodically reviews domain prompts for preservation, and selects samples with larger averaged token-level teacher-student log-probability gaps to concentrate correction signals. Across role-play dialogue and medical reasoning QA scenarios, CaMOPD performs best in general recovery over baselines while maintaining domain-specific behavior. Gradient coherence analyses further support the intended effect of CaMOPD in producing more coherent correction signals.

Tianlei Chen, Jiao Ou, Ziyuan Liu, Ruiming Tang, Jian Liang, Han Li• 2026

Related benchmarks

TaskDatasetResultRank
Logical reasoningZebraLogic
Accuracy72.6
54
Mathematical ReasoningHMMT 25
Accuracy (HMMT 25)34.58
50
Instruction FollowingIF-Eval
Accuracy59.89
14
Knowledge ReasoningGPQA Diamond
Accuracy61.99
12
CodingLCB v5
Pass@156.63
6
General EvaluationLiveBench 1125
Score49
6
Medical Reasoning QAMedXpertQA text
Accuracy23.63
6
Preference-based GenerationArena HP
Score16.4
6
CodingLCB v6
Pass@129.71
6
Medical Reasoning QAMedQA USMLE
Accuracy85.78
6
Showing 10 of 11 rows

Other info

Follow for update