Tempest: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search
About
We introduce Tempest, a multi-turn adversarial framework that models the gradual erosion of Large Language Model (LLM) safety through a tree search perspective. Unlike single-turn jailbreaks that rely on one meticulously engineered prompt, Tempest expands the conversation at each turn in a breadth-first fashion, branching out multiple adversarial prompts that exploit partial compliance from previous responses. By tracking these incremental policy leaks and re-injecting them into subsequent queries, Tempest reveals how minor concessions can accumulate into fully disallowed outputs. Evaluations on the JailbreakBench dataset show that Tempest achieves a 100% success rate on GPT-3.5-turbo and 97% on GPT-4 in a single multi-turn run, using fewer queries than baselines such as Crescendo or GOAT. This tree search methodology offers an in-depth view of how model safeguards degrade over successive dialogue turns, underscoring the urgency of robust multi-turn testing procedures for language models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Jailbreaking | AdvBench (test) | ASR (GPT-4o)85.3 | 27 | |
| Jailbreaking | HarmBench (test) | ASR (GPT-4o)83.1 | 27 | |
| Jailbreaking | JBB-Behaviors (test) | ASR (GPT-4o)86.4 | 27 | |
| Jailbreaking | StrongReject (test) | ASR (GPT-4o)79.2 | 27 | |
| Jailbreak attack success rate | AdvBench LLaMA-2-7B-Chat | ASR (SMO, GPT-4o)26 | 5 | |
| Jailbreak attack success rate | AdvBench Phi-3 Medium 14B Instruct | ASR (SMO, GPT-4o)25 | 5 | |
| Jailbreak attack success rate | AdvBench LLaMA-3.1-70B | ASR (SMO, GPT-4o)24 | 5 | |
| Multi-turn Jailbreak Evaluation | MHJ Phi-3-Medium-14B (test) | ASR76.2 | 5 | |
| Multi-turn Jailbreak Evaluation | MHJ LLaMA-3.1-70B (test) | ASR (%)81.1 | 5 |