Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

About

Despite efforts to align large language models (LLMs) with human intentions, widely-used LLMs such as GPT, Llama, and Claude are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. Across a range of popular LLMs, SmoothLLM sets the state-of-the-art for robustness against the GCG, PAIR, RandomSearch, and AmpleGCG jailbreaks. SmoothLLM is also resistant against adaptive GCG attacks, exhibits a small, though non-negligible trade-off between robustness and nominal performance, and is compatible with any LLM. Our code is publicly available at \url{https://github.com/arobey1/smooth-llm}.

Alexander Robey, Eric Wong, Hamed Hassani, George J. Pappas• 2023

Related benchmarks

TaskDatasetResultRank
Jailbreak DefenseHarmBench and AdvBench (test)
GCG Score18.4
44
Jailbreak DefenseAdvBench PAIR attack
DSR98
35
Jailbreak Defense PerformanceJailbreak Attack Dataset
DSR54.22
33
Jailbreak DefenseJailbreakBench and AdvBench
ASR3.5
21
Response Quality EvaluationMT-Bench
Average Response Quality7.35
19
Jailbreak attack success rateAdvBench-x
ASR (English)7.94
18
Jailbreak attack success rateMultiJail
ASR (EN)7.94
18
Red-Teaming (Attack Success Rate)JailbreakBench (test)
ASR (Vicuna)64
18
Jailbreak DefenseAdvBench GCG attack
DSR100
15
Jailbreak DefenseAdvBench AutoDAN attack
DSR100
15
Showing 10 of 23 rows

Other info

Follow for update