Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Refusal in Language Models Is Mediated by a Single Direction

About

Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. Specifically, for each model, we find a single direction such that erasing this direction from the model's residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions. Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities. Finally, we mechanistically analyze how adversarial suffixes suppress propagation of the refusal-mediating direction. Our findings underscore the brittleness of current safety fine-tuning methods. More broadly, our work showcases how an understanding of model internals can be leveraged to develop practical methods for controlling model behavior.

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda• 2024

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy82.5
1460
Multi-task Language UnderstandingMMLU
Accuracy78.3
842
Commonsense ReasoningWinoGrande
Accuracy78.6
776
Jailbreak AttackHarmBench
Attack Success Rate (ASR)98
376
Instruction FollowingIFEval--
292
Math ReasoningGSM8K
Accuracy92.7
126
Truthfulness EvaluationTruthfulQA
Accuracy65.3
93
Jailbreak AttackStrongREJECT
Attack Success Rate78.9
88
Safety AlignmentHarmBench
ASR11.54
88
Adversarial Attack Success RateAdvBench
ASR82.12
75
Showing 10 of 23 rows

Other info

Code

Follow for update