Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Understanding and Enhancing the Transferability of Jailbreaking Attacks

About

Jailbreaking attacks can effectively manipulate open-source large language models (LLMs) to produce harmful responses. However, these attacks exhibit limited transferability, failing to disrupt proprietary LLMs consistently. To reliably identify vulnerabilities in proprietary LLMs, this work investigates the transferability of jailbreaking attacks by analysing their impact on the model's intent perception. By incorporating adversarial sequences, these attacks can redirect the source LLM's focus away from malicious-intent tokens in the original input, thereby obstructing the model's intent recognition and eliciting harmful responses. Nevertheless, these adversarial sequences fail to mislead the target LLM's intent perception, allowing the target LLM to refocus on malicious-intent tokens and abstain from responding. Our analysis further reveals the inherent distributional dependency within the generated adversarial sequences, whose effectiveness stems from overfitting the source LLM's parameters, resulting in limited transferability to target LLMs. To this end, we propose the Perceived-importance Flatten (PiF) method, which uniformly disperses the model's focus across neutral-intent tokens in the original input, thus obscuring malicious-intent tokens without relying on overfitted adversarial sequences. Extensive experiments demonstrate that PiF provides an effective and efficient red-teaming evaluation for proprietary LLMs.

Runqi Lin, Bo Han, Fengwang Li, Tongling Liu• 2025

Related benchmarks

TaskDatasetResultRank
Jailbreak AttackLlama2-7b five finetuned variants
Average ASR0.00e+0
16
Jailbreak AttackLLaMA3-8B
Average ASR0.00e+0
16
Jailbreak AttackDeepSeek-7b five finetuned variants
Average ASR27.4
16
Jailbreak Attack TransferabilityDeepSeek-llm-7b-chat finetuned variants v1 (test)
TSR24.2
16
Jailbreak AttackGemma-7b five finetuned variants
Average ASR12
16
Jailbreak Attack TransferabilityGemma-7b-it finetuned variants v1 (test)
TSR11.6
16
Jailbreak Attack TransferabilityLlama-2-7b-chat finetuned variants v1 (test)
Transfer Success Rate (TSR)0.00e+0
16
Jailbreak Attack TransferabilityLlama-3-8b-Instruct finetuned variants v1 (test)
TSR0.00e+0
16
Jailbreak Attackllama2-7b v1 (pretrained)
ASR0.00e+0
13
Jailbreak Attackllama3-8b pretrained v1
ASR0.00e+0
13
Showing 10 of 12 rows

Other info

Follow for update