Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack

About

Recent studies show that Large Language Models (LLMs) with safety alignment can be jail-broken by fine-tuning on a dataset mixed with harmful data. First time in the literature, we show that the jail-broken effect can be mitigated by separating states in the finetuning stage to optimize the alignment and user datasets. Unfortunately, our subsequent study shows that this simple Bi-State Optimization (BSO) solution experiences convergence instability when steps invested in its alignment state is too small, leading to downgraded alignment performance. By statistical analysis, we show that the \textit{excess drift} towards consensus could be a probable reason for the instability. To remedy this issue, we propose \textbf{L}azy(\textbf{i}) \textbf{s}afety \textbf{a}lignment (\textbf{Lisa}), which introduces a proximal term to constraint the drift of each state. Theoretically, the benefit of the proximal term is supported by the convergence analysis, wherein we show that a sufficient large proximal factor is necessary to guarantee Lisa's convergence. Empirically, our results on four downstream finetuning tasks show that Lisa with a proximal term can significantly increase alignment performance while maintaining the LLM's accuracy on the user tasks. Code is available at \url{https://github.com/git-disl/Lisa}.

Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu• 2024

Related benchmarks

TaskDatasetResultRank
Safety EvaluationHEX-PHI--
148
Safety EvaluationHarmBench
Harmbench Score20.75
76
Mathematical ReasoningGSM8K (test)
HS48
62
Mathematical ReasoningGSM8K (test)
Finetune Accuracy69.5
40
Safety EvaluationHarmful Prompts
Harmful Score13.4
40
Harmful score evaluationBeaverTails (test)
Harmful Score14.8
36
Text ClassificationSST-2
Harmful Score50.7
35
Instruction FollowingAlpacaEval (test)
Helpfulness Score37.4
32
Safety defense against harmful fine-tuning attacksAlpaca harmful subset (test)
Harmful Score34.9
21
Sentiment AnalysisSST2
Attack Success Rate (ASR)6.6
17
Showing 10 of 17 rows

Other info

Code

Follow for update