One-Shot Safety Alignment for Large Language Models via Optimal Dualization

About

The growing safety concerns surrounding large language models raise an urgent need to align them with diverse human preferences to simultaneously enhance their helpfulness and safety. A promising approach is to enforce safety constraints through Reinforcement Learning from Human Feedback (RLHF). For such constrained RLHF, typical Lagrangian-based primal-dual policy optimization methods are computationally expensive and often unstable. This paper presents a perspective of dualization that reduces constrained alignment to an equivalent unconstrained alignment problem. We do so by pre-optimizing a smooth and convex dual function that has a closed form. This shortcut eliminates the need for cumbersome primal-dual policy iterations, greatly reducing the computational burden and improving training stability. Our strategy leads to two practical algorithms in model-based and preference-based settings (MoCAN and PeCAN, respectively). A broad range of experiments demonstrate the effectiveness and merits of our algorithms.

Xinmeng Huang, Shuo Li, Edgar Dobriban, Osbert Bastani, Hamed Hassani, Dongsheng Ding• 2024

Related benchmarks

Task	Dataset	Result
Instruction Following	AlpacaEval	Win Rate95.71	420
General Knowledge	MMLU	MMLU General Knowledge Accuracy71.83	307
Math Reasoning	MATH	Accuracy75.5	160
Code Generation	LiveCodeBench	Pass@10.2416	86
Math Reasoning	OlympiadBench	Accuracy39.65	76
Safety Evaluation	Safety Suite AdvBench, PKU-SafeRLHF, HarmBench, JailbreakBench, SORRY-Bench, HarmfulQA, ALERT	AdvBench Score0.19	56
General LLM Capability	General Capability Suite MMLU, AlpacaEval, GSM8K, MATH, HumanEval	MMLU70.18	56
Harmful Request Defense	AdvBench	ASR0.38	44
Prohibited Content Detection	ALERT	ASR0.0865	34
Harmful Query	Sorry	ASR (%)12.79	20

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord