Safety Alignment as Continual Learning: Mitigating the Alignment Tax via Orthogonal Gradient Projection

About

Safety post-training can improve the harmfulness and policy compliance of Large Language Models (LLMs), but it may also reduce general utility, a phenomenon often described as the \emph{alignment tax}. We study this trade-off through the lens of continual learning: sequential alignment stages expose the model to shifted data distributions and objectives, and their gradients may interfere with directions that support previously acquired general capabilities. This view does not claim that all alignment degradation has a single cause; rather, it provides a useful first-order mechanism for mitigating one important source of capability regression. We propose \textbf{O}rthogonal \textbf{G}radient \textbf{P}rojection for \textbf{S}afety \textbf{A}lignment (\textbf{OGPSA}), a lightweight update rule that estimates a low-rank reference subspace from gradients on a small set of general-capability data and removes from each safety gradient the component lying in this subspace. The resulting update is the steepest local safety-descent direction subject to first-order preservation constraints on the reference objectives. OGPSA is compatible with standard post-training pipelines and avoids large-scale replay, although it introduces periodic reference-gradient computation. Across Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and sequential SFT$\rightarrow$DPO settings, OGPSA improves the observed safety--utility trade-off over standard baselines. Under the sequential SFT$\rightarrow$DPO pipeline, the average performance gain increases from 33.98\% to 42.74\% on Qwen2.5-7B-Instruct and from 19.74\% to 32.98\% on Llama3.1-8B-Instruct. We have open sourced our code at https://github.com/SunGL001/OGPSA.

Guanglong Sun, Siyuan Zhang, Liyuan Wang, Jun Zhu, Hang Su, Yi Zhong• 2026

Related benchmarks

Task	Dataset	Result
Instruction Following	IFEval (test)	IFEval Score63.96	88
Multi-task Language Understanding	MMLU (test)	Normalized Accuracy70.93	87
Safety Evaluation	XSTest (test)	XSTest Score95	36
Safety Evaluation	WildChat (test)	WildChat Score49.4	13
Adversarial Robustness	AdvGLUE (test)	--	6
Safety Evaluation	Stereotype (test)	Stereotype Score100	5
Safety Evaluation	StrongReject (test)	--	4
Helpfulness evaluation	HHH (test)	HHH Score90.68	3
Question Answering	SimpleQA (test)	SimpleQA Score3.61	3
Expert-Level Question Answering	GPQA (test)	GPQA Score34.51	2

Showing 10 of 10 rows

Other info

GitHub

Follow for update

@wizwand_team Discord