Fast Adversarial Attacks on Language Models In One GPU Minute

About

In this paper, we introduce a novel class of fast, beam search-based adversarial attack (BEAST) for Language Models (LMs). BEAST employs interpretable parameters, enabling attackers to balance between attack speed, success rate, and the readability of adversarial prompts. The computational efficiency of BEAST facilitates us to investigate its applications on LMs for jailbreaking, eliciting hallucinations, and privacy attacks. Our gradient-free targeted attack can jailbreak aligned LMs with high attack success rates within one minute. For instance, BEAST can jailbreak Vicuna-7B-v1.5 under one minute with a success rate of 89% when compared to a gradient-based baseline that takes over an hour to achieve 70% success rate using a single Nvidia RTX A6000 48GB GPU. Additionally, we discover a unique outcome wherein our untargeted attack induces hallucinations in LM chatbots. Through human evaluations, we find that our untargeted attack causes Vicuna-7B-v1.5 to produce ~15% more incorrect outputs when compared to LM outputs in the absence of our attack. We also learn that 22% of the time, BEAST causes Vicuna to generate outputs that are not relevant to the original prompt. Further, we use BEAST to generate adversarial prompts in a few seconds that can boost the performance of existing membership inference attacks for LMs. We believe that our fast attack, BEAST, has the potential to accelerate research in LM security and privacy. Our codebase is publicly available at https://github.com/vinusankars/BEAST.

Vinu Sankar Sadasivan, Shoumik Saha, Gaurang Sriramanan, Priyatham Kattakinda, Atoosa Chegini, Soheil Feizi• 2024

Related benchmarks

Task	Dataset	Result
Adversarial Attack	AdvBench (test)	ASR83.85	145
Adversarial jailbreak attack	Guanaco 7B	Attack Success Rate (ASR)100	58
Adversarial jailbreak attack	Vicuna 7B	Attack Success Rate (ASR)93.65	58
Token-forcing loss optimization	Random targets Held-out (val)	Qwen-2.5-7B Loss12.74	56
Adversarial jailbreak attack	Vicuna 13B	Attack Success Rate (ASR)84.8	55
Adversarial Attack	Mistral-7B	ASR57.12	45
LLM Jailbreaking	AdvBench	ASR-M58.08	16
LLM Jailbreaking	HarmBench text (test N = 320)	ASR-M69.38	16
Adversarial jailbreak attack	Mistral-7B	Attack Success Rate (ASR)57.12	13
Feature Visualization	Gemma Scope Gemma 2 2B 16k-wide residual stream 486-latents SAE (train/test)	Win Rate50.2	12

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord