DPO-Shift: Shifting the Distribution of Direct Preference Optimization
About
Direct Preference Optimization (DPO) and its variants have become increasingly popular for aligning language models with human preferences. These methods aim to teach models to better distinguish between chosen (or preferred) and rejected (or dispreferred) responses. However, prior research has identified that the probability of chosen responses often decreases during training, and this phenomenon is known as likelihood displacement. To tackle this challenge, in this work we introduce DPO-Shift to controllably shift the distribution of the chosen probability. Then, we show that DPO-Shift exhibits a fundamental trade-off between improving the chosen probability and sacrificing the reward margin, as supported by both theoretical analysis and experimental validation. Furthermore, we demonstrate the superiority of DPO-Shift over DPO on downstream tasks such as MT-Bench and a designed win rate experiment. We believe this study shows that the likelihood displacement issue of DPO can be effectively mitigated with a simple, theoretically grounded solution. Our code is available at https://github.com/Meaquadddd/DPO-Shift.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy89 | 1455 | |
| Multimodal Reasoning | MMStar | -- | 143 | |
| Visual Question Answering | SimpleVQA | Accuracy0.397 | 99 | |
| Hallucination assessment | HallusionBench | -- | 39 | |
| Multimodal Reasoning | MMBench CN | Accuracy81 | 36 | |
| Instruction Following | MM-IFEval | Score50 | 28 | |
| Multimodal Reasoning | MMBench EN | Accuracy83 | 24 | |
| Generative Hallucination Evaluation | AMBER | Score89.97 | 14 | |
| OCR Parsing | OCRBench EN v2 | Score46.7 | 14 | |
| OCR Parsing | OCRBench ZH v2 | Overall Score45 | 14 |