SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
About
Despite efforts to align large language models (LLMs) with human intentions, widely-used LLMs such as GPT, Llama, and Claude are susceptible to jailbreaking attacks, wherein an adversary fools a targeted LLM into generating objectionable content. To address this vulnerability, we propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks. Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs. Across a range of popular LLMs, SmoothLLM sets the state-of-the-art for robustness against the GCG, PAIR, RandomSearch, and AmpleGCG jailbreaks. SmoothLLM is also resistant against adaptive GCG attacks, exhibits a small, though non-negligible trade-off between robustness and nominal performance, and is compatible with any LLM. Our code is publicly available at \url{https://github.com/arobey1/smooth-llm}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Jailbreak Defense | HarmBench and AdvBench (test) | GCG Score18.4 | 44 | |
| Jailbreak Defense | AdvBench PAIR attack | DSR98 | 35 | |
| Jailbreak Defense Performance | Jailbreak Attack Dataset | DSR54.22 | 33 | |
| Jailbreak Defense | JailbreakBench and AdvBench | ASR3.5 | 21 | |
| Response Quality Evaluation | MT-Bench | Average Response Quality7.35 | 19 | |
| Jailbreak attack success rate | AdvBench-x | ASR (English)7.94 | 18 | |
| Jailbreak attack success rate | MultiJail | ASR (EN)7.94 | 18 | |
| Red-Teaming (Attack Success Rate) | JailbreakBench (test) | ASR (Vicuna)64 | 18 | |
| Jailbreak Defense | AdvBench GCG attack | DSR100 | 15 | |
| Jailbreak Defense | AdvBench AutoDAN attack | DSR100 | 15 |