Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization

About

This study addresses the challenge of noise in training datasets for Direct Preference Optimization (DPO), a method for aligning Large Language Models (LLMs) with human preferences. We categorize noise into pointwise noise, which includes low-quality data points, and pairwise noise, which encompasses erroneous data pair associations that affect preference rankings. Utilizing Distributionally Robust Optimization (DRO), we enhance DPO's resilience to these types of noise. Our theoretical insights reveal that DPO inherently embeds DRO principles, conferring robustness to pointwise noise, with the regularization coefficient $\beta$ playing a critical role in its noise resistance. Extending this framework, we introduce Distributionally Robustifying DPO (Dr. DPO), which integrates pairwise robustness by optimizing against worst-case pairwise scenarios. The novel hyperparameter $\beta'$ in Dr. DPO allows for fine-tuned control over data pair reliability, providing a strategic balance between exploration and exploitation in noisy training environments. Empirical evaluations demonstrate that Dr. DPO substantially improves the quality of generated text and response accuracy in preference datasets, showcasing enhanced performance in both noisy and noise-free settings. The code is available at https://github.com/junkangwu/Dr_DPO.

Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jiawei Chen, Jinyang Gao, Bolin Ding, Xiang Wang, Xiangnan He• 2024

Related benchmarks

Task	Dataset	Result
Safety Alignment	HH-RLHF	MD Rate2.39	68
Safety Alignment	Salad Bench	MD1.53	68
Safety Alignment	Do-Not-Answer	MD0.21	52
Safety Evaluation	HarmBench	--	39
Safety Alignment	PKU-SafeRLHF 30K (IID)	WR88.93	36
Safety Alignment Evaluation	PKU-SafeRLHF 30K (test)	Win Rate (WR)88.93	32
Preference Alignment	UFB	Win Rate65.8	32
Safety Alignment	Average (Do-Not-Answer, HarmBench, HH-RLHF, Salad Bench)	Aggregate Score1.2	18
Safety Alignment	HarmBench	MD Score2	18
Safety Evaluation	Do-Not-Answer	MD Rate8.96	16

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord