Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions

About

Training large language models to follow instructions makes them perform better on a wide range of tasks and generally become more helpful. However, a perfectly helpful model will follow even the most malicious instructions and readily generate harmful content. In this paper, we raise concerns over the safety of models that only emphasize helpfulness, not harmlessness, in their instruction-tuning. We show that several popular instruction-tuned models are highly unsafe. Moreover, we show that adding just 3% safety examples (a few hundred demonstrations) when fine-tuning a model like LLaMA can substantially improve its safety. Our safety-tuning does not make models significantly less capable or helpful as measured by standard benchmarks. However, we do find exaggerated safety behaviours, where too much safety-tuning makes models refuse perfectly safe prompts if they superficially resemble unsafe ones. As a whole, our results illustrate trade-offs in training LLMs to be helpful and training them to be safe.

Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul R\"ottger, Dan Jurafsky, Tatsunori Hashimoto, James Zou• 2023

Related benchmarks

TaskDatasetResultRank
Question AnsweringARC Easy
Accuracy80
597
Question AnsweringPIQA
Accuracy82
374
Instruction FollowingAlpacaEval
Win Rate38.96
227
Multiple-choice Question AnsweringMMLU
Accuracy71
185
Safety EvaluationHEX-PHI
HEx-PHI Score64.63
162
KnowledgeMMLU
Accuracy67.14
136
Safety EvaluationHarmBench
Harmbench Score2
112
ReasoningGSM8K--
106
Question AnsweringARC Challenge
Normalized Accuracy56
86
KnowledgeGPQA
Accuracy23.48
51
Showing 10 of 49 rows

Other info

Follow for update