Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

About

Fine-tuning large language models (LLMs) improves performance but introduces critical safety vulnerabilities: even minimal harmful data can severely compromise safety measures. We observe that perturbations orthogonal to the alignment direction - defined by weight differences between aligned (safe) and unaligned models - rapidly compromise model safety. In contrast, updates along the alignment direction largely preserve it, revealing the parameter space as a "narrow safety basin". To address this, we propose AsFT (Anchoring Safety in Fine-Tuning) to maintain safety by explicitly constraining update directions during fine-tuning. By penalizing updates orthogonal to the alignment direction, AsFT effectively constrains the model within the "narrow safety basin," thus preserving its inherent safety. Extensive experiments on multiple datasets and models show that AsFT reduces harmful behaviors by up to 7.60%, improves task performance by 3.44%, and consistently outperforms existing methods across multiple tasks.

Shuo Yang, Qihui Zhang, Yuyang Liu, Xiaojun Jia, Kunpeng Ning, Jiayu Yao, Jigang Wang, Hailiang Dai, Yibing Song, Li Yuan• 2025

Related benchmarks

TaskDatasetResultRank
Instruction FollowingAlpacaEval
Win Rate42.62
227
Safety EvaluationHarmBench
Harmbench Score22.5
112
Mathematical ReasoningGSM8K (test)
HS17.2
62
Topic ClassificationAGNews
FA Score0.9273
48
Mathematical ReasoningGSM8K (test)
Finetune Accuracy68.7
40
Safety EvaluationHarmful Prompts
Harmful Score16.1
40
Harmful score evaluationBeaverTails (test)
Harmful Score16.1
36
Mathematical ReasoningGSM8K
FA83.31
28
Medical Question AnsweringPubMedQA
Factual Accuracy (FA)95.18
28
Sentiment AnalysisSST-2
False Acceptance Rate (FA)88.93
28
Showing 10 of 16 rows

Other info

Follow for update