Safety Anchor: Defending Harmful Fine-tuning via Geometric Bottlenecks

About

The safety alignment of Large Language Models (LLMs) remains vulnerable to Harmful Fine-tuning (HFT). While existing defenses impose constraints on parameters, gradients, or internal representations, we observe that they can be effectively circumvented under persistent HFT. Our analysis traces this failure to the inherent redundancy of the high-dimensional parameter space: attackers exploit optimization trajectories that are orthogonal to defense constraints to restore harmful capabilities while deceptively adhering to safety restrictions. To address this, we propose Safety Bottleneck Regularization (SBR). SBR shifts the defensive focus from the redundant parameter space to the unembedding layer, which serves as a geometric bottleneck. By anchoring the final hidden states of harmful queries to those of the safety-aligned model, SBR enables the model to maintain safe responses even under persistent HFT. Extensive experiments confirm SBR's effectiveness, demonstrating that utilizing just a single safety anchor is sufficient to reduce the Harmful Score to $<$10 while preserving competitive performance on benign downstream tasks.

Guoxin Lu, Letian Sha, Qing Wang, Peijie Sun, Hao Zhou, Hua Dai, Fu Xiao• 2026

Related benchmarks

Task	Dataset	Result
Sentiment Classification	SST2 (test)	--	233
Instruction Following	AlpacaEval (test)	Helpfulness Score6.2	65
News Classification	AG News (test)	Accuracy90.1	48
Defense against Harmful Fine-tuning	Backdoor Jailbreaking No Trigger	Harm Score1.8	6
Utility Preservation	General Utility Evaluation	FA Score92.89	6
Defense against Harmful Fine-tuning	Backdoor Jailbreaking With Trigger	HS Score4.7	6
Mathematical Reasoning	GSM8K (test)	Hit Score (HS)5.6	6

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord