Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

About

Large Language Models (LLMs) often incur an alignment tax: safety post-training can reduce general utility (e.g., reasoning and coding). We argue that this tax primarily arises from continual-learning-style forgetting in sequential alignment, where distribution shift and conflicting objectives cause safety updates to overwrite pre-trained competencies. Accordingly, we cast safety alignment as a continual learning (CL) problem that must balance plasticity (acquiring safety constraints) and stability (preserving general abilities). We propose Orthogonal Gradient Projection for Safety Alignment (OGPSA), a lightweight method that mitigates interference by constraining each safety update to be orthogonal (in a first-order sense) to a learned subspace capturing general capabilities. Specifically, OGPSA estimates a low-rank capability subspace from gradients on a small reference set and projects the safety gradient onto its orthogonal complement before updating. This produces safety-directed updates that minimally perturb prior knowledge while retaining capacity for alignment. OGPSA is plug-and-play and integrates into standard post-training pipelines without large-scale replay, auxiliary objectives, or retraining. Across Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and sequential SFT$\rightarrow$DPO settings, OGPSA consistently improves the safety--utility Pareto frontier over standard baselines. For instance, on Qwen2.5-7B-Instruct under SFT$\rightarrow$DPO, OGPSA preserves strong safety while recovering general capability, improving SimpleQA from 0.53\% to 3.03\% and IFEval from 51.94\% to 63.96\%. Our source code is available at \href{https://github.com/SunGL001/OGPSA}{OGPSA}

Guanglong Sun, Siyuan Zhang, Liyuan Wang, Jun Zhu, Hang Su, Yi Zhong• 2026

Related benchmarks

TaskDatasetResultRank
Multi-task Language UnderstandingMMLU (test)
Normalized Accuracy70.93
76
Instruction FollowingIFEval (test)
IFEval Score63.96
45
Safety EvaluationXSTest (test)
XSTest Score95
32
Safety EvaluationWildChat (test)
WildChat Score49.4
13
Adversarial RobustnessAdvGLUE (test)--
6
Safety EvaluationStereotype (test)
Stereotype Score100
5
Safety EvaluationStrongReject (test)--
4
Helpfulness evaluationHHH (test)
HHH Score90.68
3
Question AnsweringSimpleQA (test)
SimpleQA Score3.61
3
Expert-Level Question AnsweringGPQA (test)
GPQA Score34.51
2
Showing 10 of 10 rows

Other info

Follow for update