Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

One Leak Away: How Pretrained Model Exposure Amplifies Jailbreak Risks in Finetuned LLMs

About

Finetuning pretrained large language models (LLMs) has become the standard paradigm for developing downstream applications. However, its security implications remain unclear, particularly regarding whether finetuned LLMs inherit jailbreak vulnerabilities from their pretrained sources. We investigate this question in a realistic pretrain-to-finetune threat model, where the attacker has white-box access to the pretrained LLM and only black-box access to its finetuned derivatives. Empirical analysis shows that adversarial prompts optimized on the pretrained model transfer most effectively to its finetuned variants, revealing inherited vulnerabilities from pretrained to finetuned LLMs. To further examine this inheritance, we conduct representation-level probing, which shows that transferable prompts are linearly separable within the pretrained hidden states, suggesting that universal transferability is encoded in pretrained representations. Building on this insight, we propose the Probe-Guided Projection (PGP) attack, which steers optimization toward transferability-relevant directions. Experiments across multiple LLM families and diverse finetuned tasks confirm PGP's strong transfer success, underscoring the security risks inherent in the pretrain-to-finetune paradigm.

Yixin Tan, Zhe Yu, Jun Sakuma• 2025

Related benchmarks

TaskDatasetResultRank
Jailbreak AttackGemma-7b five finetuned variants
Average ASR66.2
16
Jailbreak Attack TransferabilityLlama-2-7b-chat finetuned variants v1 (test)
Transfer Success Rate (TSR)60.4
16
Jailbreak Attack TransferabilityLlama-3-8b-Instruct finetuned variants v1 (test)
TSR51.2
16
Jailbreak Attack TransferabilityGemma-7b-it finetuned variants v1 (test)
TSR65.4
16
Jailbreak Attack TransferabilityDeepSeek-llm-7b-chat finetuned variants v1 (test)
TSR86.8
16
Jailbreak AttackDeepSeek-7b five finetuned variants
Average ASR87
16
Jailbreak AttackLlama2-7b five finetuned variants
Average ASR62.6
16
Jailbreak AttackLLaMA3-8B
Average ASR52
16
Jailbreak Attackdeepseek-7b v1 (pretrained)
ASR (%)100
13
Jailbreak Attackllama2-7b v1 (pretrained)
ASR0.82
13
Showing 10 of 12 rows

Other info

Follow for update