Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

daDPO: Distribution-Aware DPO for Distilling Conversational Abilities

About

Large language models (LLMs) have demonstrated exceptional performance across various applications, but their conversational abilities decline sharply as model size decreases, presenting a barrier to their deployment in resource-constrained environments. Knowledge distillation with Direct Preference Optimization (dDPO) has emerged as a promising approach to enhancing the conversational abilities of smaller models using a larger teacher model. However, current methods primarily focus on 'black-box' KD, which only uses the teacher's responses, overlooking the output distribution offered by the teacher. This paper addresses this gap by introducing daDPO (Distribution-Aware DPO), a unified method for preference optimization and distribution-based distillation. We provide rigorous theoretical analysis and empirical validation, showing that daDPO outperforms existing methods in restoring performance for pruned models and enhancing smaller LLM models. Notably, in in-domain evaluation, our method enables a 20% pruned Vicuna1.5-7B to achieve near-teacher performance (-7.3% preference rate compared to that of dDPO's -31%), and allows Qwen2.5-1.5B to occasionally outperform its 7B teacher model (14.0% win rate).

Zhengze Zhang, Shiqi Wang, Yiqun Shen, Simin Guo, Dahua Lin, Xiaoliang Wang, Nguyen Cam-Tu, Fei Tan• 2025

Related benchmarks

TaskDatasetResultRank
Multi-turn Dialogue EvaluationMT-Bench
Overall Score6.1
331
Instruction FollowingAlpacaEval
Win Rate81.49
125
Instruction FollowingArena Hard
Win Rate22.4
77
Instruction Following and Helpfulness EvaluationAlpacaEval 2.0
Win Rate16.41
58
Instruction FollowingIn-domain
Win Rate14
11
Preference Alignment EvaluationIndomain
Win Rate-7.3
11
Showing 6 of 6 rows

Other info

Code

Follow for update