Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates

About

Public LLMs such as the Llama 2-Chat underwent alignment training and were considered safe. Recently Qi et al. [2024] reported that even benign fine-tuning on seemingly safe datasets can give rise to unsafe behaviors in the models. The current paper is about methods and best practices to mitigate such loss of alignment. We focus on the setting where a public model is fine-tuned before serving users for specific usage, where the model should improve on the downstream task while maintaining alignment. Through extensive experiments on several chat models (Meta's Llama 2-Chat, Mistral AI's Mistral 7B Instruct v0.2, and OpenAI's GPT-3.5 Turbo), this paper uncovers that the prompt templates used during fine-tuning and inference play a crucial role in preserving safety alignment, and proposes the ``Pure Tuning, Safe Testing'' (PTST) strategy -- fine-tune models without a safety prompt, but include it at test time. This seemingly counterintuitive strategy incorporates an intended distribution shift to encourage alignment preservation. Fine-tuning experiments on GSM8K, ChatDoctor, and OpenOrca show that PTST significantly reduces the rise of unsafe behaviors.

Kaifeng Lyu, Haoyu Zhao, Xinran Gu, Dingli Yu, Anirudh Goyal, Sanjeev Arora• 2024

Related benchmarks

TaskDatasetResultRank
Question AnsweringOpenBookQA
Accuracy83.25
305
Safety EvaluationBeavertails
ASR37.2
19
Safety EvaluationLatHarmful
ASR10.71
14
Safety EvaluationQ-LatHarmful
Attack Success Rate (ASR)16.9
14
Safety EvaluationI-BeaverTails
Attack Success Rate (ASR)56.39
14
Safety EvaluationHarmful Fine-tuning Attacks Average
ASR79.48
7
Safety DefenseBeavertails
ASR62.4
7
Safety DefenseQ-LatHarmful
ASR83.5
7
Safety DefenseI-BeaverTails
ASR70.23
7
Safety DefenseLatHarmful
ASR82.42
7
Showing 10 of 10 rows

Other info

Code

Follow for update