Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

About

AI alignment in the shape of Reinforcement Learning from Human Feedback (RLHF) is increasingly treated as a crucial ingredient for high performance large language models. Proximal Policy Optimization (PPO) has been positioned by recent literature as the canonical method for the RL part of RLHF. However, it involves both high computational cost and sensitive hyperparameter tuning. We posit that most of the motivational principles that led to the development of PPO are less of a practical concern in RLHF and advocate for a less computationally expensive method that preserves and even increases performance. We revisit the formulation of alignment from human preferences in the context of RL. Keeping simplicity as a guiding principle, we show that many components of PPO are unnecessary in an RLHF context and that far simpler REINFORCE-style optimization variants outperform both PPO and newly proposed "RL-free" methods such as DPO and RAFT. Our work suggests that careful adaptation to LLMs alignment characteristics enables benefiting from online RL optimization at low cost.

Arash Ahmadian, Chris Cremer, Matthias Gall\'e, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet \"Ust\"un, Sara Hooker• 2024

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K (test)
Accuracy58.31
900
Mathematical ReasoningAIME 2024
Accuracy30
370
Mathematical ReasoningCollegeMATH
Accuracy41.4
276
Mathematical ReasoningAMC
Accuracy55
221
Mathematical ReasoningMinerva Math
Accuracy36.4
186
Mathematical ReasoningAIME 2024 (test)
Accuracy20
159
Multimodal ReasoningMMMU (val)
Accuracy51.89
144
Multimodal ReasoningWeMath
Accuracy57.99
129
Interactive Decision-makingAlfWorld
Overall Success Rate75.5
118
Single-hop Question AnsweringPopQA--
104
Showing 10 of 105 rows
...

Other info

Follow for update