AdvPrefix: An Objective for Nuanced LLM Jailbreaks

About

Many jailbreak attacks on large language models (LLMs) rely on a common objective: making the model respond with the prefix ``Sure, here is (harmful request)''. While straightforward, this objective has two limitations: limited control over model behaviors, yielding incomplete or unrealistic jailbroken responses, and a rigid format that hinders optimization. We introduce AdvPrefix, a plug-and-play prefix-forcing objective that selects one or more model-dependent prefixes by combining two criteria: high prefilling attack success rates and low negative log-likelihood. AdvPrefix integrates seamlessly into existing jailbreak attacks to mitigate the previous limitations for free. For example, replacing GCG's default prefixes on Llama-3 improves nuanced attack success rates from 14% to 80%, revealing that current safety alignment fails to generalize to new prefixes. Code and selected prefixes are released at github.com/facebookresearch/jailbreak-objectives.

Sicheng Zhu, Brandon Amos, Yuandong Tian, Chuan Guo, Ivan Evtimov• 2024

Related benchmarks

Task	Dataset	Result
Jailbreak Attack	HarmBench	--	557
Adversarial Attack Success Rate	AdvBench	ASR60.58	75
Red-teaming Safety Evaluation	StrongREJECT	ASR65.18	53
Jailbreak Attack	AdvBench (test)	ASR (GPT-4o)77	8
Jailbreak Attack	AdvBench	ASR-G17	5

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord