Curriculum Learning for Safety Alignment

About

Direct Preference Optimisation (DPO) is widely used for safety alignment in large language models. However, prior work shows it is brittle and exhibits poor out-of-distribution (OOD) generalisation. In this paper, we investigate whether Curriculum Learning can improve the robustness of DPO-based safety alignment. We propose Staged-Competence, a curriculum-based framework that organises preference data by difficulty, employs competence-based sampling, and progressively updates the reference model during training. Averaged across three model families, Staged-Competence reduces OOD harmful response rates by 16% and jailbreak attack success rates by 20%, while preserving general capabilities with near-zero over-refusal. We further show that Staged-Competence (1) matches baseline safety with only 75% of the training data and (2) yields better separation between safe and unsafe responses. Staged-Competence is agnostic to the policy optimisation loss and can extend to other DPO variants and alignment domains. Our code and data are available at https://github.com/Sandeep5500/curriculum-learning-for-safety.

Sandeep Kumar, Virginia Smith, Chhavi Yadav• 2026

Related benchmarks

Task	Dataset	Result
Over-refusal	XSTest	Overrefusal Rate0.00e+0	102
General Language Understanding	MMLU	MMLU Accuracy68.2	29
Safety Alignment	AdvBench	Harm Rate0.2	25
Safety Alignment Evaluation	Sorrybench	Harmful Response Rate (%)4.2	18
Safety Alignment Evaluation	HEX-PHI	Harmful Response Rate0.7	18
Jailbreak Robustness	Jailbreak Attacks	Prefill Success Rate24.2	18
Reward Accuracy	Cleaned-PKU-HH-SafeRLHF (test)	Reward Accuracy91.3	15
Safety Alignment Evaluation	OOD Safety Suite Average of SorryBench, AdvBench, and HEx-PHI	Average Absolute Improvement-7.1	12

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord