Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

About

Optimizing large language models (LLMs) for downstream use cases often involves the customization of pre-trained LLMs through further fine-tuning. Meta's open release of Llama models and OpenAI's APIs for fine-tuning GPT-3.5 Turbo on custom datasets also encourage this practice. But, what are the safety costs associated with such custom fine-tuning? We note that while existing safety alignment infrastructures can restrict harmful behaviors of LLMs at inference time, they do not cover safety risks when fine-tuning privileges are extended to end-users. Our red teaming studies find that the safety alignment of LLMs can be compromised by fine-tuning with only a few adversarially designed training examples. For instance, we jailbreak GPT-3.5 Turbo's safety guardrails by fine-tuning it on only 10 such examples at a cost of less than $0.20 via OpenAI's APIs, making the model responsive to nearly any harmful instructions. Disconcertingly, our research also reveals that, even without malicious intent, simply fine-tuning with benign and commonly used datasets can also inadvertently degrade the safety alignment of LLMs, though to a lesser extent. These findings suggest that fine-tuning aligned LLMs introduces new safety risks that current safety infrastructures fall short of addressing -- even if a model's initial safety alignment is impeccable, it is not necessarily to be maintained after custom fine-tuning. We outline and critically analyze potential mitigations and advocate for further research efforts toward reinforcing safety protocols for the custom fine-tuning of aligned LLMs.

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, Peter Henderson• 2023

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Accuracy70.14	1398
Visual Mathematical Reasoning	MathVista	Accuracy59.57	366
Jailbreak Attack	AdvBench	AASR50.52	271
Safety Evaluation	HEX-PHI	HEx-PHI Score1	162
Question Answering	OpenBookQA	Accuracy43.6	145
Safety Evaluation	AdvBench	Safety Score100	117
Safety Evaluation	SORRY-Bench	Safety Score98.41	90
Malicious Training Sample Detection	Attack Training Samples (test)	Detectability100	30
Safety Evaluation	Sorry-Bench base	Safety Score87.73	27
Backdoor Defense	Code Injection (test)	ASR31.47	22

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord