Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

About

Fine-tuning large language models (LLMs) on human preferences, typically through reinforcement learning from human feedback (RLHF), has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during fine-tuning remains a critical concern, and mitigating the potential conflicts in safety and helpfulness is costly in RLHF. To address this issue, we propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO), which re-parameterizes a joint RLHF objective of both safety and helpfulness into a single supervised learning objective. In supervised optimization, a labeling function is used to capture the global preferences ranking to balance both safety and helpfulness. To evaluate BFPO, we develop a benchmark that includes comprehensive discriminative and generative tasks for helpfulness and harmlessness. The results indicate that our method significantly outperforms existing approaches in both safety and helpfulness. Moreover, BFPO achieves the same level of safety as methods that heavily rely on human labor with less than 10\% of the computational resources and human prompting and annotation process. The training recipes can be found here: https://github.com/wx-zhang/bfpo.

Wenxuan Zhang, Philip H.S. Torr, Mohamed Elhoseiny, Adel Bibi• 2024

Related benchmarks

Task	Dataset	Result
Instruction Following	AlpacaEval	Win Rate97.2	420
General Knowledge	MMLU	MMLU General Knowledge Accuracy71.72	307
Math Reasoning	MATH	Accuracy74.55	160
Code Generation	LiveCodeBench	Pass@10.2436	86
Math Reasoning	OlympiadBench	Accuracy36.35	76
General LLM Capability	General Capability Suite MMLU, AlpacaEval, GSM8K, MATH, HumanEval	MMLU71.86	56
Safety Evaluation	Safety Suite AdvBench, PKU-SafeRLHF, HarmBench, JailbreakBench, SORRY-Bench, HarmfulQA, ALERT	AdvBench Score0.19	56
Harmful Request Defense	AdvBench	ASR0.64	44
Prohibited Content Detection	ALERT	ASR0.0722	34
Math and Reasoning	GSM8K	Accuracy83.85	20

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord