DPO Meets PPO: Reinforced Token Optimization for RLHF

About

In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards -- a challenging scenario in traditional deep reinforcement learning. Despite the great successes of PPO in the alignment of large language models, its open-source implementation is still largely sub-optimal. To address these issues, we introduce a framework that models RLHF problems as a Markov decision process (MDP), enabling the capture of fine-grained token-wise information. Under this framework, we introduce an algorithm Reinforced Token Optimization (\texttt{RTO}), which learns the token-wise reward function from preference data and performs policy optimization based on this learned token-wise reward signal. Theoretically, \texttt{RTO} is proven to have the capability of finding the near-optimal policy sample-efficiently. For its practical implementation, \texttt{RTO} innovatively integrates Direct Preference Optimization (DPO) and PPO. DPO, originally derived from sparse sentence rewards, surprisingly provides us with a token-wise characterization of response quality, which is seamlessly incorporated into our subsequent PPO training stage. Extensive experiments demonstrate that \texttt{RTO} performs better than PPO and other direct preference learning algorithms. In particular, RTO outperforms PPO by 7.5 points on the AlpacaEval 2 benchmark and by 4.1 points on Arena-Hard. Our code and models are available at \href{https://github.com/zkshan2002/RTO}{https://github.com/zkshan2002/RTO}.

Han Zhong, Zikang Shan, Guhao Feng, Wei Xiong, Xinle Cheng, Li Zhao, Di He, Jiang Bian, Liwei Wang• 2024

Related benchmarks

Task	Dataset	Result
Multi-turn Dialogue Evaluation	MT-Bench	Overall Score6.94	532
Mathematical Reasoning	MathQA	Accuracy44.1	354
Math Reasoning	GSM8K	Accuracy73.2	254
Mathematical Reasoning	GSM-PLUS	Accuracy53	162
LLM Alignment Evaluation	AlpacaEval 2	LC Win Rate46.84	89
LLM Alignment Evaluation	Arena Hard	Win Rate30.4	73
Instruction Following and Helpfulness Evaluation	AlpacaEval 2.0	Win Rate16.91	58
Safety Alignment Evaluation	Llama-Guard	Harmfulness (%)85.71	36
Mathematical Reasoning	GSM8K FR (test)	Accuracy60.8	7
Mathematical Reasoning	GSM8K ZH (test)	Accuracy (ZH)55.6	7

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord