Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection
About
Large Language Models (LLMs) often incur an alignment tax: safety post-training can reduce general utility (e.g., reasoning and coding). We argue that this tax primarily arises from continual-learning-style forgetting in sequential alignment, where distribution shift and conflicting objectives cause safety updates to overwrite pre-trained competencies. Accordingly, we cast safety alignment as a continual learning (CL) problem that must balance plasticity (acquiring safety constraints) and stability (preserving general abilities). We propose Orthogonal Gradient Projection for Safety Alignment (OGPSA), a lightweight method that mitigates interference by constraining each safety update to be orthogonal (in a first-order sense) to a learned subspace capturing general capabilities. Specifically, OGPSA estimates a low-rank capability subspace from gradients on a small reference set and projects the safety gradient onto its orthogonal complement before updating. This produces safety-directed updates that minimally perturb prior knowledge while retaining capacity for alignment. OGPSA is plug-and-play and integrates into standard post-training pipelines without large-scale replay, auxiliary objectives, or retraining. Across Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and sequential SFT$\rightarrow$DPO settings, OGPSA consistently improves the safety--utility Pareto frontier over standard baselines. For instance, on Qwen2.5-7B-Instruct under SFT$\rightarrow$DPO, OGPSA preserves strong safety while recovering general capability, improving SimpleQA from 0.53\% to 3.03\% and IFEval from 51.94\% to 63.96\%. Our source code is available at \href{https://github.com/SunGL001/OGPSA}{OGPSA}
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multi-task Language Understanding | MMLU (test) | Normalized Accuracy70.93 | 76 | |
| Instruction Following | IFEval (test) | IFEval Score63.96 | 45 | |
| Safety Evaluation | XSTest (test) | XSTest Score95 | 32 | |
| Safety Evaluation | WildChat (test) | WildChat Score49.4 | 13 | |
| Adversarial Robustness | AdvGLUE (test) | -- | 6 | |
| Safety Evaluation | Stereotype (test) | Stereotype Score100 | 5 | |
| Safety Evaluation | StrongReject (test) | -- | 4 | |
| Helpfulness evaluation | HHH (test) | HHH Score90.68 | 3 | |
| Question Answering | SimpleQA (test) | SimpleQA Score3.61 | 3 | |
| Expert-Level Question Answering | GPQA (test) | GPQA Score34.51 | 2 |