Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DPO-Shift: Shifting the Distribution of Direct Preference Optimization

About

Direct Preference Optimization (DPO) and its variants have become increasingly popular for aligning language models with human preferences. These methods aim to teach models to better distinguish between chosen (or preferred) and rejected (or dispreferred) responses. However, prior research has identified that the probability of chosen responses often decreases during training, and this phenomenon is known as likelihood displacement. To tackle this challenge, in this work we introduce DPO-Shift to controllably shift the distribution of the chosen probability. Then, we show that DPO-Shift exhibits a fundamental trade-off between improving the chosen probability and sacrificing the reward margin, as supported by both theoretical analysis and experimental validation. Furthermore, we demonstrate the superiority of DPO-Shift over DPO on downstream tasks such as MT-Bench and a designed win rate experiment. We believe this study shows that the likelihood displacement issue of DPO can be effectively mitigated with a simple, theoretically grounded solution. Our code is available at https://github.com/Meaquadddd/DPO-Shift.

Xiliang Yang, Feng Jiang, Qianen Zhang, Lei Zhao, Xiao Li• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy89
1455
Multimodal ReasoningMMStar--
143
Visual Question AnsweringSimpleVQA
Accuracy0.397
99
Hallucination assessmentHallusionBench--
39
Multimodal ReasoningMMBench CN
Accuracy81
36
Instruction FollowingMM-IFEval
Score50
28
Multimodal ReasoningMMBench EN
Accuracy83
24
Generative Hallucination EvaluationAMBER
Score89.97
14
OCR ParsingOCRBench EN v2
Score46.7
14
OCR ParsingOCRBench ZH v2
Overall Score45
14
Showing 10 of 10 rows

Other info

Follow for update