Injecting Universal Jailbreak Backdoors into LLMs in Minutes

About

Jailbreak backdoor attacks on LLMs have garnered attention for their effectiveness and stealth. However, existing methods rely on the crafting of poisoned datasets and the time-consuming process of fine-tuning. In this work, we propose JailbreakEdit, a novel jailbreak backdoor injection method that exploits model editing techniques to inject a universal jailbreak backdoor into safety-aligned LLMs with minimal intervention in minutes. JailbreakEdit integrates a multi-node target estimation to estimate the jailbreak space, thus creating shortcuts from the backdoor to this estimated jailbreak space that induce jailbreak actions. Our attack effectively shifts the models' attention by attaching strong semantics to the backdoor, enabling it to bypass internal safety mechanisms. Experimental results show that JailbreakEdit achieves a high jailbreak success rate on jailbreak prompts while preserving generation quality, and safe performance on normal queries. Our findings underscore the effectiveness, stealthiness, and explainability of JailbreakEdit, emphasizing the need for more advanced defense mechanisms in LLMs.

Zhuowei Chen, Qiannan Zhang, Shichao Pei• 2025

Related benchmarks

Task	Dataset	Result
Instruction Following	AlpacaEval	Win Rate10.5	420
Backdoor Attack	DAN (Do-Anything-Now)	ASRw78.79	48
Backdoor Attack	Misuse	ASRw68.83	48
Backdoor Attack	DNA	ASRw80.23	30
Backdoor Attack	DNA (Do-Not-Answer)	ASR (w/ Trigger)66.21	18
Factual Answering	TruthfulQA	Truthfulness Score56.6	18
Backdoor Attack Evaluation	StrongREJECT	ASR (w/ trigger)0.385	18
Mathematical Reasoning	GSM-8K	GSM Accuracy72.7	18

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord