PrisonBreak: Jailbreaking Large Language Models with at Most Twenty-Five Targeted Bit-flips

About

We study a new vulnerability in commercial-scale safety-aligned large language models (LLMs): their refusal to generate harmful responses can be broken by flipping only a few bits in model parameters. Our attack jailbreaks billion-parameter language models with just 5 to 25 bit-flips, requiring up to 40$\times$ fewer bit flips than prior attacks on much smaller computer vision models. Unlike prompt-based jailbreaks, our method directly uncensors models in memory at runtime, enabling harmful outputs without requiring input-level modifications. Our key innovation is an efficient bit-selection algorithm that identifies critical bits for language model jailbreaks up to 20$\times$ faster than prior methods. We evaluate our attack on 10 open-source LLMs, achieving high attack success rates (ASRs) of 80-98% with minimal impact on model utility. We further demonstrate an end-to-end exploit via Rowhammer-based fault injection, reliably jailbreaking 5 models (69-91% ASR) on a GDDR6 GPU. Our analyses reveal that: (1) models with weaker post-training alignment require fewer bit-flips to jailbreak; (2) certain model components, e.g., value projection layers, are substantially more vulnerable; and (3) the attack is mechanistically different from existing jailbreak methods. We evaluate potential countermeasures and find that our attack remains effective against defenses at various stages of the LLM pipeline.

Zachary Coalson, Jeonghyun Woo, Chris S. Lin, Joyce Qu, Yu Sun, Shiyang Chen, Lishan Yang, Gururaj Saileshwar, Prashant Nair, Bo Fang, Sanghyun Hong• 2024

Related benchmarks

Task	Dataset	Result
Factual Question Answering	TriviaQA (test)	Accuracy73	29
Mathematical Reasoning	GSM8K (test)	Accuracy80	29
Reading Comprehension	DROP (test)	F1 Score66	29
Jailbreaking	AdvBench (test)	ASR (GPT-4o)78.4	27
Jailbreaking	HarmBench (test)	ASR (GPT-4o)75.6	27
Jailbreaking	JBB-Behaviors (test)	ASR (GPT-4o)79.8	27
Jailbreaking	StrongReject (test)	ASR (GPT-4o)72	27
Inference Cost Attack	Alpaca Samantha-7B (test)	Average Length1.75e+3	6
Inference Cost Attack	Alpaca Llama2-7B (test)	Average Length712	6
Inference Cost Attack	Alpaca Vicuna-7B (test)	Average Length3	6

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord