SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models
About
Safety alignment in large language models is remarkably shallow: it is concentrated in the first few output tokens and reversible by fine-tuning on as few as 100 adversarial examples. This fragility becomes critical in real-world deployment, where models undergo sequential adaptation across domains such as medicine, law, and code, causing safety guardrails to erode cumulatively. Yet all existing safety-preserving methods target only single-task fine-tuning, leaving the multi-domain sequential setting entirely unaddressed. We introduce SafeAnchor, a framework that anchors safety in place throughout continual adaptation. SafeAnchor first identifies low-rank safety subspaces in LoRA parameter space via Fisher Information eigendecomposition, then constrains domain-specific gradient updates to the orthogonal complement of these subspaces, and finally monitors for residual safety drift with threshold-triggered corrective replay. Evaluated on Llama-2-7B-Chat and Mistral-7B-Instruct across a three-domain pipeline and eight benchmarks, SafeAnchor retains 93.2% of original safety alignment, outperforming all baselines by 18-42 points, while matching unconstrained fine-tuning to within 1.5 points on domain tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Safety Evaluation | HarmBench | -- | 148 | |
| Massive Multitask Language Understanding | MMLU | Accuracy55.9 | 129 | |
| Malicious Prompt Refusal | HarmBench | Refusal Rate78.4 | 23 | |
| Safety Evaluation | HarmBench, TruthfulQA, and BBQ | Safety Score85.2 | 16 | |
| Massive Multitask Language Understanding | MMLU | MMLU45.7 | 16 | |
| Domain Adaptation Utility | Specialized Domains Medical, Legal, Code, Finance, Science | Composite Domain Score61.4 | 10 | |
| Sequential Domain Adaptation | Medical, Legal, and Code Sequential | Domain Score61.4 | 9 | |
| Domain Adaptation | Medical, Legal, and Code | Domain Score63.8 | 7 |