One Leak Away: How Pretrained Model Exposure Amplifies Jailbreak Risks in Finetuned LLMs

About

Finetuning pretrained large language models (LLMs) has become the standard paradigm for developing downstream applications. However, its security implications remain unclear, particularly regarding whether finetuned LLMs inherit jailbreak vulnerabilities from their pretrained sources. We investigate this question in a realistic pretrain-to-finetune threat model, where the attacker has white-box access to the pretrained LLM and only black-box access to its finetuned derivatives. Empirical analysis shows that adversarial prompts optimized on the pretrained model transfer most effectively to its finetuned variants, revealing inherited vulnerabilities from pretrained to finetuned LLMs. To further examine this inheritance, we conduct representation-level probing, which shows that transferable prompts are linearly separable within the pretrained hidden states, suggesting that universal transferability is encoded in pretrained representations. Building on this insight, we propose the Probe-Guided Projection (PGP) attack, which steers optimization toward transferability-relevant directions. Experiments across multiple LLM families and diverse finetuned tasks confirm PGP's strong transfer success, underscoring the security risks inherent in the pretrain-to-finetune paradigm.

Yixin Tan, Zhe Yu, Jun Sakuma• 2025

Related benchmarks

Task	Dataset	Result
Jailbreak Attack	Gemma-7b five finetuned variants	Average ASR66.2	16
Jailbreak Attack Transferability	Llama-2-7b-chat finetuned variants v1 (test)	Transfer Success Rate (TSR)60.4	16
Jailbreak Attack Transferability	Llama-3-8b-Instruct finetuned variants v1 (test)	TSR51.2	16
Jailbreak Attack Transferability	Gemma-7b-it finetuned variants v1 (test)	TSR65.4	16
Jailbreak Attack Transferability	DeepSeek-llm-7b-chat finetuned variants v1 (test)	TSR86.8	16
Jailbreak Attack	DeepSeek-7b five finetuned variants	Average ASR87	16
Jailbreak Attack	Llama2-7b five finetuned variants	Average ASR62.6	16
Jailbreak Attack	LLaMA3-8B	Average ASR52	16
Jailbreak Attack	deepseek-7b v1 (pretrained)	ASR (%)100	13
Jailbreak Attack	llama2-7b v1 (pretrained)	ASR0.82	13

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord