Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints
About
Jailbreaking attacks can effectively induce unsafe behaviors in Large Language Models (LLMs); however, the transferability of these attacks across different models remains limited. This study aims to understand and enhance the transferability of gradient-based jailbreaking methods, which are among the standard approaches for attacking white-box models. Through a detailed analysis of the optimization process, we introduce a novel conceptual framework to elucidate transferability and identify superfluous constraints-specifically, the response pattern constraint and the token tail constraint-as significant barriers to improved transferability. Removing these unnecessary constraints substantially enhances the transferability and controllability of gradient-based attacks. Evaluated on Llama-3-8B-Instruct as the source model, our method increases the overall Transfer Attack Success Rate (T-ASR) across a set of target models with varying safety levels from 18.4% to 50.3%, while also improving the stability and controllability of jailbreak behaviors on both source and target models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Jailbreak Attack | Gemma-7b five finetuned variants | Average ASR44.2 | 16 | |
| Jailbreak Attack Transferability | Llama-2-7b-chat finetuned variants v1 (test) | Transfer Success Rate (TSR)28.8 | 16 | |
| Jailbreak Attack Transferability | Gemma-7b-it finetuned variants v1 (test) | TSR39.4 | 16 | |
| Jailbreak Attack Transferability | Llama-3-8b-Instruct finetuned variants v1 (test) | TSR27.4 | 16 | |
| Jailbreak Attack Transferability | DeepSeek-llm-7b-chat finetuned variants v1 (test) | TSR47.4 | 16 | |
| Jailbreak Attack | DeepSeek-7b five finetuned variants | Average ASR53.8 | 16 | |
| Jailbreak Attack | LLaMA3-8B | Average ASR29.8 | 16 | |
| Jailbreak Attack | Llama2-7b five finetuned variants | Average ASR33.2 | 16 | |
| Jailbreak Attack | deepseek-7b v1 (pretrained) | ASR (%)81 | 13 | |
| Jailbreak Attack | llama3-8b pretrained v1 | ASR61 | 13 |