Safe RLHF: Safe Reinforcement Learning from Human Feedback

About

With the development of large language models (LLMs), striking a balance between the performance and safety of AI systems has never been more critical. However, the inherent tension between the objectives of helpfulness and harmlessness presents a significant challenge during LLM training. To address this issue, we propose Safe Reinforcement Learning from Human Feedback (Safe RLHF), a novel algorithm for human value alignment. Safe RLHF explicitly decouples human preferences regarding helpfulness and harmlessness, effectively avoiding the crowdworkers' confusion about the tension and allowing us to train separate reward and cost models. We formalize the safety concern of LLMs as an optimization task of maximizing the reward function while satisfying specified cost constraints. Leveraging the Lagrangian method to solve this constrained problem, Safe RLHF dynamically adjusts the balance between the two objectives during fine-tuning. Through a three-round fine-tuning using Safe RLHF, we demonstrate a superior ability to mitigate harmful responses while enhancing model performance compared to existing value-aligned algorithms. Experimentally, we fine-tuned the Alpaca-7B using Safe RLHF and aligned it with collected human preferences, significantly improving its helpfulness and harmlessness according to human evaluations.

Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, Yaodong Yang• 2023

Related benchmarks

Task	Dataset	Result
Instruction Following	AlpacaEval	Win Rate57.02	423
General Knowledge	MMLU	MMLU General Knowledge Accuracy70.77	373
Math Reasoning	MATH	Accuracy44.63	160
Code Generation	LiveCodeBench	Pass@10.2184	86
Math Reasoning	OlympiadBench	Accuracy22.22	76
Harmful Request Defense	AdvBench	ASR6.64	65
Prohibited Content Detection	ALERT	ASR0.1524	34
Red-team Query	HarmQA	ASR (%)2.76	20
Harmful Query	Sorry	ASR (%)37.95	20
Math and Reasoning	GSM8K	Accuracy75.28	20

Showing 10 of 56 rows

Other info

Follow for update

@wizwand_team Discord