Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RLHF Workflow: From Reward Modeling to Online RLHF

About

We present the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF) in this technical report, which is widely reported to outperform its offline counterpart by a large margin in the recent large language model (LLM) literature. However, existing open-source RLHF projects are still largely confined to the offline learning setting. In this technical report, we aim to fill in this gap and provide a detailed recipe that is easy to reproduce for online iterative RLHF. In particular, since online human feedback is usually infeasible for open-source communities with limited resources, we start by constructing preference models using a diverse set of open-source datasets and use the constructed proxy preference model to approximate human feedback. Then, we discuss the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed practical implementation. Our trained LLM achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. We have shown that supervised fine-tuning (SFT) and iterative RLHF can obtain state-of-the-art performance with fully open-source datasets. Further, we have made our models, curated datasets, and comprehensive step-by-step code guidebooks publicly available. Please refer to https://github.com/RLHFlow/RLHF-Reward-Modeling and https://github.com/RLHFlow/Online-RLHF for more detailed information.

Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang• 2024

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy80.83
1891
Code GenerationHumanEval--
1036
Multi-task Language UnderstandingMMLU
Accuracy65.19
876
Instruction FollowingIFEval
IFEval Accuracy65.47
625
Mathematical ReasoningMATH
Accuracy80
535
Instruction FollowingAlpacaEval 2.0
Win Rate8.17
507
Multitask Language UnderstandingMMLU
Accuracy65.13
413
Mathematical ReasoningGSM8K
Accuracy (GSM8K)77.18
358
Instruction FollowingMT-Bench
MT-Bench Score5.93
215
Common Sense ReasoningHellaSwag
Accuracy73.15
213
Showing 10 of 25 rows

Other info

Follow for update