Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack

About

The new paradigm of finetuning-as-a-service introduces a new attack surface for Large Language Models (LLMs): a few harmful data uploaded by users can easily trick the finetuning to produce an alignment-broken model. We conduct an empirical analysis and uncover a \textit{harmful embedding drift} phenomenon, showing a probable cause of the alignment-broken effect. Inspired by our findings, we propose Vaccine, a perturbation-aware alignment technique to mitigate the security risk of users finetuning. The core idea of Vaccine is to produce invariant hidden embeddings by progressively adding crafted perturbation to them in the alignment phase. This enables the embeddings to withstand harmful perturbation from un-sanitized user data in the finetuning phase. Our results on open source mainstream LLMs (e.g., Llama2, Opt, Vicuna) demonstrate that Vaccine can boost the robustness of alignment against harmful prompts induced embedding drift while reserving reasoning ability towards benign prompts. Our code is available at \url{https://github.com/git-disl/Vaccine}.

Tiansheng Huang, Sihao Hu, Ling Liu• 2024

Related benchmarks

Task	Dataset	Result
Instruction Following	AlpacaEval	--	420
Sentiment Classification	SST2 (test)	--	233
Instruction Following	Alpaca	--	173
Safety Evaluation	HEX-PHI	--	162
Sentiment Analysis	SST-2 (test)	Accuracy95	144
Instruction Following	AlpacaEval (test)	Helpfulness Score62.9	65
Harmful question-answering	BeaverTails HarmfulQA (1k and 10k samples)	Avg Harmfulness Score0.05	63
Mathematical Reasoning	GSM8K (test)	HS51.4	62
Topic Classification	AGNews	FA Score0.892	58
Harmful score evaluation	BeaverTails (test)	Harmful Score0.707	52

Showing 10 of 45 rows

Other info

Code

Follow for update

@wizwand_team Discord