Lisa: Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning Attack
About
Recent studies show that Large Language Models (LLMs) with safety alignment can be jail-broken by fine-tuning on a dataset mixed with harmful data. First time in the literature, we show that the jail-broken effect can be mitigated by separating states in the finetuning stage to optimize the alignment and user datasets. Unfortunately, our subsequent study shows that this simple Bi-State Optimization (BSO) solution experiences convergence instability when steps invested in its alignment state is too small, leading to downgraded alignment performance. By statistical analysis, we show that the \textit{excess drift} towards consensus could be a probable reason for the instability. To remedy this issue, we propose \textbf{L}azy(\textbf{i}) \textbf{s}afety \textbf{a}lignment (\textbf{Lisa}), which introduces a proximal term to constraint the drift of each state. Theoretically, the benefit of the proximal term is supported by the convergence analysis, wherein we show that a sufficient large proximal factor is necessary to guarantee Lisa's convergence. Empirically, our results on four downstream finetuning tasks show that Lisa with a proximal term can significantly increase alignment performance while maintaining the LLM's accuracy on the user tasks. Code is available at \url{https://github.com/git-disl/Lisa}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Safety Evaluation | HEX-PHI | -- | 148 | |
| Safety Evaluation | HarmBench | Harmbench Score20.75 | 76 | |
| Mathematical Reasoning | GSM8K (test) | HS48 | 62 | |
| Mathematical Reasoning | GSM8K (test) | Finetune Accuracy69.5 | 40 | |
| Safety Evaluation | Harmful Prompts | Harmful Score13.4 | 40 | |
| Harmful score evaluation | BeaverTails (test) | Harmful Score14.8 | 36 | |
| Text Classification | SST-2 | Harmful Score50.7 | 35 | |
| Instruction Following | AlpacaEval (test) | Helpfulness Score37.4 | 32 | |
| Safety defense against harmful fine-tuning attacks | Alpaca harmful subset (test) | Harmful Score34.9 | 21 | |
| Sentiment Analysis | SST2 | Attack Success Rate (ASR)6.6 | 17 |