DPO-Shift: Shifting the Distribution of Direct Preference Optimization

About

Direct Preference Optimization (DPO) and its variants have become increasingly popular for aligning language models with human preferences. These methods aim to teach models to better distinguish between chosen (or preferred) and rejected (or dispreferred) responses. However, prior research has identified that the probability of chosen responses often decreases during training, and this phenomenon is known as likelihood displacement. To tackle this challenge, in this work we introduce DPO-Shift to controllably shift the distribution of the chosen probability. Then, we show that DPO-Shift exhibits a fundamental trade-off between improving the chosen probability and sacrificing the reward margin, as supported by both theoretical analysis and experimental validation. Furthermore, we demonstrate the superiority of DPO-Shift over DPO on downstream tasks such as MT-Bench and a designed win rate experiment. We believe this study shows that the likelihood displacement issue of DPO can be effectively mitigated with a simple, theoretically grounded solution. Our code is available at https://github.com/Meaquadddd/DPO-Shift.

Xiliang Yang, Feng Jiang, Qianen Zhang, Lei Zhao, Xiao Li• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy89	2019
Visual Question Answering	SimpleVQA	Accuracy0.397	164
Multimodal Reasoning	MMStar	--	143
Multimodal Reasoning	MMBench CN	Accuracy81	113
Instruction Following	MM-IFEval	Score50	46
Multimodal Reasoning	MMBench EN	Accuracy83	40
Hallucination assessment	HallusionBench	--	39
Generative Hallucination Evaluation	AMBER	Score89.97	14
OCR Parsing	OCRBench EN v2	Score46.7	14
OCR Parsing	OCRBench ZH v2	Overall Score45	14

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord