Simulated Ensemble Attack: Transferring Jailbreaks Across Fine-tuned Vision-Language Models

About

The widespread practice of fine-tuning open-source Vision-Language Models (VLMs) raises a critical security concern: jailbreak vulnerabilities in base models may persist in downstream variants, enabling transferable attacks across fine-tuned systems. To investigate this risk, we propose the Simulated Ensemble Attack (SEA), a grey-box jailbreak framework that assumes full access to the base VLM but no knowledge of the fine-tuned target. SEA enhances transferability via Fine-tuning Trajectory Simulation (FTS), which models bounded parameter variations in the vision encoder, and Targeted Prompt Guidance (TPG), which stabilizes adversarial optimization through auxiliary textual guidance. Experiments on the Qwen2-VL family demonstrate that SEA achieves consistently high transfer success and toxicity rates across diverse fine-tuned variants, including safety-enhanced models, while standard PGD-based image jailbreaks exhibit negligible transferability. Further analysis reveals that fine-tuning primarily induces localized parameter shifts around the base model, explaining why attacks optimized over a simulated neighborhood transfer effectively. We also show that SEA generalizes across different base generations (e.g., Qwen2.5/3-VL), indicating that its effectiveness arises from shared fine-tuning-induced behaviors rather than architecture- or initialization-specific factors.

Ruofan Wang, Xin Wang, Yang Yao, Juncheng Li, Xuan Tong, Xingjun Ma• 2025

Related benchmarks

Task	Dataset	Result
Jailbreak Attack	SafeBench	ASR26.51	245
Jailbreaking	HarmBench	Attack Success Rate (ASR)78.5	68
Jailbreak Attack	JailBreakV_28K	Attack Success Rate (ASR)85.23	57
Jailbreak Attack	LLaMA3-8B	Average ASR10.8	16
Jailbreak Attack	DeepSeek-7b five finetuned variants	Average ASR45.8	16
Jailbreak Attack Transferability	Llama-2-7b-chat finetuned variants v1 (test)	Transfer Success Rate (TSR)20.4	16
Jailbreak Attack	Llama2-7b five finetuned variants	Average ASR20.4	16
Jailbreak Attack	Gemma-7b five finetuned variants	Average ASR25.6	16
Jailbreak Attack Transferability	DeepSeek-llm-7b-chat finetuned variants v1 (test)	TSR45.8	16
Jailbreak Attack Transferability	Gemma-7b-it finetuned variants v1 (test)	TSR25.6	16

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord