SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

About

As Large Language Models (LLMs) are increasingly deployed in real-world applications, balancing helpfulness and safety has become a central challenge. A natural approach is to incorporate safety constraints into Reinforcement Learning from Human Feedback (RLHF), where recent studies have shown promising progress. However, these methods often rely on auxiliary networks or multi-stage pipelines, thereby increasing complexity. In this work, we revisit the original safety alignment objective and show that, under mild assumptions, it admits a closed-form optimal policy. We further derive a provably equivalent and tractable objective, enabling direct optimization. Building on this insight, we propose SafeDPO, a lightweight method that preserves the optimal solution of the underlying safety-constrained objective while requiring only one additional hyperparameter and minimal modifications to existing preference-based training methods. SafeDPO eliminates the need for reward models, cost models, and online sampling, relying only on preference data and safety indicators. Despite its simplicity, SafeDPO achieves competitive safety-helpfulness trade-offs compared to existing safety alignment methods. Experiments on the PKU-SafeRLHF-30K benchmark demonstrate that SafeDPO substantially improves safety while maintaining competitive helpfulness. Ablation studies further show that the additional hyperparameter provides a flexible mechanism to enhance safety while preserving the theoretical optimum, and confirm that SafeDPO scales reliably to LLMs with up to 13B parameters. Overall, our results highlight that a simple, theory-driven objective can provide a lightweight yet effective solution for safety alignment in practice.

Geon-Hyeong Kim, Yu Jin Kim, Byoungjip Kim, Honglak Lee, Kyunghoon Bae, Youngsoo Jang, Moontae Lee• 2025

Related benchmarks

Task	Dataset	Result
Safety Alignment	XSTest	Compliance68.5	21
Helpful and Harmless Response Generation	PKU-SafeRLHF alpaca2-7b (test)	Helpfulness8.2	14
LLM Safety Alignment	PKU-SafeRLHF full deduplicated (test)	Helpfulness8.09	12
Safety Evaluation	XSTest	Refusal Rate12.4	7
Safe RLHF Alignment	PKU-SafeRLHF 30K	--	7
Harmlessness	PKU-SafeRLHF 30K	Win Rate87.25	6
Helpfulness	PKU-SafeRLHF 30K	Win Rate84.5	6
Harmlessness	Template T3 GPT-4 evaluation (test)	Win Rate87.5	5
Harmlessness	GPT-4 Evaluation Template T2 (overall)	Win Rate89.99	5
Harmlessness evaluation	Harmlessness (evaluation set)	Win Rate48.76	5

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord