Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Curriculum Learning for Safety Alignment

About

Direct Preference Optimisation (DPO) is widely used for safety alignment in large language models. However, prior work shows it is brittle and exhibits poor out-of-distribution (OOD) generalisation. In this paper, we investigate whether Curriculum Learning can improve the robustness of DPO-based safety alignment. We propose Staged-Competence, a curriculum-based framework that organises preference data by difficulty, employs competence-based sampling, and progressively updates the reference model during training. Averaged across three model families, Staged-Competence reduces OOD harmful response rates by 16% and jailbreak attack success rates by 20%, while preserving general capabilities with near-zero over-refusal. We further show that Staged-Competence (1) matches baseline safety with only 75% of the training data and (2) yields better separation between safe and unsafe responses. Staged-Competence is agnostic to the policy optimisation loss and can extend to other DPO variants and alignment domains. Our code and data are available at https://github.com/Sandeep5500/curriculum-learning-for-safety.

Sandeep Kumar, Virginia Smith, Chhavi Yadav• 2026

Related benchmarks

TaskDatasetResultRank
Over-refusalXSTest
Overrefusal Rate0.00e+0
102
General Language UnderstandingMMLU
MMLU Accuracy68.2
29
Safety AlignmentAdvBench
Harm Rate0.2
25
Safety Alignment EvaluationSorrybench
Harmful Response Rate (%)4.2
18
Safety Alignment EvaluationHEX-PHI
Harmful Response Rate0.7
18
Jailbreak RobustnessJailbreak Attacks
Prefill Success Rate24.2
18
Reward AccuracyCleaned-PKU-HH-SafeRLHF (test)
Reward Accuracy91.3
15
Safety Alignment EvaluationOOD Safety Suite Average of SorryBench, AdvBench, and HEx-PHI
Average Absolute Improvement-7.1
12
Showing 8 of 8 rows

Other info

Follow for update