Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Break Me If You Can: Self-Jailbreaking of Aligned LLMs via Lexical Insertion Prompting

About

We introduce \emph{self-jailbreaking}, a threat model in which an aligned LLM guides its own compromise. Unlike most jailbreak techniques, which often rely on handcrafted prompts or separate attacker models, self-jailbreaking requires no external red-team LLM: the target model's own internal knowledge suffices. We operationalize this via \textbf{Self-Jailbreaking via Lexical Insertion Prompting (\textsc{SLIP})}, a black-box algorithm that casts jailbreaking as breadth-first tree search over multi-turn dialogues, incrementally inserting missing content words from the attack goal into benign prompts using the target model as its own guide. Evaluations on AdvBench and HarmBench show \textsc{SLIP} achieves 90--100\% Attack Success Rate (ASR) (avg.\ 94.7\%) across most of the eleven tested models (including GPT-5.1, Claude-Sonnet-4.5, Gemini-2.5-Pro, and DeepSeek-V3), with only ${\sim}7.9$ LLM calls on average, 3--6$\times$ fewer than prior methods. We evaluate existing defenses, show that regex-based approaches are evaded by prompt paraphrasing, and propose the Semantic Drift Monitor (SDM) defense that tracks \textsc{SLIP}'s embedding-space trajectory, achieving 76\% detection at 5\% FPR. However, SDM remains insufficient against adaptive attack strategies, underscoring the need for more advanced defense mechanisms tailored to the self-jailbreaking threat surface. We release our code for reproducibility.

Devang Kulshreshtha, Hang Su, Haibo Jin, Chinmay Hegde, Haohan Wang• 2026

Related benchmarks

TaskDatasetResultRank
Jailbreak AttackHarmBench
Attack Success Rate (ASR)99.6
487
Jailbreak AttackAdvBench
AASR100
263
JailbreakAdvBench
Avg Queries2.1
63
Showing 3 of 3 rows

Other info

Follow for update