SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models

About

Safety alignment in large language models is remarkably shallow: it is concentrated in the first few output tokens and reversible by fine-tuning on as few as 100 adversarial examples. This fragility becomes critical in real-world deployment, where models undergo sequential adaptation across domains such as medicine, law, and code, causing safety guardrails to erode cumulatively. Yet all existing safety-preserving methods target only single-task fine-tuning, leaving the multi-domain sequential setting entirely unaddressed. We introduce SafeAnchor, a framework that anchors safety in place throughout continual adaptation. SafeAnchor first identifies low-rank safety subspaces in LoRA parameter space via Fisher Information eigendecomposition, then constrains domain-specific gradient updates to the orthogonal complement of these subspaces, and finally monitors for residual safety drift with threshold-triggered corrective replay. Evaluated on Llama-2-7B-Chat and Mistral-7B-Instruct across a three-domain pipeline and eight benchmarks, SafeAnchor retains 93.2% of original safety alignment, outperforming all baselines by 18-42 points, while matching unconstrained fine-tuning to within 1.5 points on domain tasks.

Dongxin Guo, Jikun Wu, Siu Ming Yiu• 2026

Related benchmarks

Task	Dataset	Result
Safety Evaluation	HarmBench	--	153
Massive Multitask Language Understanding	MMLU	Accuracy55.9	137
Malicious Prompt Refusal	HarmBench	Refusal Rate78.4	38
Safety Evaluation	HarmBench, TruthfulQA, and BBQ	Safety Score85.2	16
Massive Multitask Language Understanding	MMLU	MMLU45.7	16
Domain Adaptation Utility	Specialized Domains Medical, Legal, Code, Finance, Science	Composite Domain Score61.4	10
Sequential Domain Adaptation	Medical, Legal, and Code Sequential	Domain Score61.4	9
Domain Adaptation	Medical, Legal, and Code	Domain Score63.8	7

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord