Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models

About

The widespread practice of fine-tuning open-source Vision-Language Models (VLMs) raises a critical security concern: jailbreak vulnerabilities in base models may persist in downstream variants, enabling transferable attacks across fine-tuned systems. To investigate this risk, we propose the Simulated Ensemble Attack (SEA), a grey-box jailbreak framework that assumes full access to the base VLM but no knowledge of the fine-tuned target. SEA enhances transferability via Fine-tuning Trajectory Simulation (FTS), which models bounded parameter variations in the vision encoder, and Targeted Prompt Guidance (TPG), which stabilizes adversarial optimization through auxiliary textual guidance. Experiments on the Qwen2-VL family demonstrate that SEA achieves consistently high transfer success and toxicity rates across diverse fine-tuned variants, including safety-enhanced models, while standard PGD-based image jailbreaks exhibit negligible transferability. Further analysis reveals that fine-tuning primarily induces localized parameter shifts around the base model, explaining why attacks optimized over a simulated neighborhood transfer effectively. We also show that SEA generalizes across different base generations (e.g., Qwen2.5/3-VL), indicating that its effectiveness arises from shared fine-tuning-induced behaviors rather than architecture- or initialization-specific factors.

Ruofan Wang, Xin Wang, Yang Yao, Juncheng Li, Xuan Tong, Xingjun Ma• 2025

Related benchmarks

TaskDatasetResultRank
Jailbreak AttackLLaMA3-8B
Average ASR10.8
16
Jailbreak AttackDeepSeek-7b five finetuned variants
Average ASR45.8
16
Jailbreak Attack TransferabilityLlama-2-7b-chat finetuned variants v1 (test)
Transfer Success Rate (TSR)20.4
16
Jailbreak AttackLlama2-7b five finetuned variants
Average ASR20.4
16
Jailbreak AttackGemma-7b five finetuned variants
Average ASR25.6
16
Jailbreak Attack TransferabilityDeepSeek-llm-7b-chat finetuned variants v1 (test)
TSR45.8
16
Jailbreak Attack TransferabilityGemma-7b-it finetuned variants v1 (test)
TSR25.6
16
Jailbreak Attack TransferabilityLlama-3-8b-Instruct finetuned variants v1 (test)
TSR10.8
16
Jailbreak Attackdeepseek-7b v1 (pretrained)
ASR (%)60
13
Jailbreak Attackllama3-8b pretrained v1
ASR35
13
Showing 10 of 12 rows

Other info

Follow for update