Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

About

Reinforcement Learning from Human Feedback (RLHF) is currently the most widely used method to align large language models (LLMs) with human preferences. Existing RLHF methods can be roughly categorized as either reward-based or reward-free. Novel applications such as ChatGPT and Claude leverage reward-based methods that first learn a reward model and apply actor-critic algorithms, such as Proximal Policy Optimization (PPO). However, in academic benchmarks, state-of-the-art results are often achieved via reward-free methods, such as Direct Preference Optimization (DPO). Is DPO truly superior to PPO? Why does PPO perform poorly on these benchmarks? In this paper, we first conduct both theoretical and empirical studies on the algorithmic properties of DPO and show that DPO may have fundamental limitations. Moreover, we also comprehensively examine PPO and reveal the key factors for the best performances of PPO in fine-tuning LLMs. Finally, we benchmark DPO and PPO across a collection of RLHF testbeds, ranging from dialogue to code generation. Experiment results demonstrate that PPO is able to surpass other alignment methods in all cases and achieve state-of-the-art results in challenging code competitions. Our code is publicly available at https://github.com/openpsi-project/ReaLHF.

Shusheng Xu, Wei Fu, Jiaxuan Gao, Wenjie Ye, Weilin Liu, Zhiyu Mei, Guangju Wang, Chao Yu, Yi Wu• 2024

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MathQA	Accuracy71.9	354
Mathematical Reasoning	MATH level 5	Success Rate40.2	20
General Reasoning	BBH	Success Rate (BBH General Reasoning)66.8	14
Logic Consistency Evaluation	Aggregate of 16 tasks	LCS (Avg)51.2	14
Mathematical Reasoning	MATH Overall	SR48.5	14
Mathematical Reasoning	MATH Level-4	SR (%)54.3	14
Decision Inference	MMLU	Accuracy0.723	11
Decision Inference	SMAC	Accuracy65.3	11

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord