TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts

About

Machine learning is advancing rapidly, with applications bringing notable benefits, such as improvements in translation and code generation. Models like ChatGPT, powered by Large Language Models (LLMs), are increasingly integrated into daily life. However, alongside these benefits, LLMs also introduce social risks. Malicious users can exploit LLMs by submitting harmful prompts, such as requesting instructions for illegal activities. To mitigate this, models often include a security mechanism that automatically rejects such harmful prompts. However, they can be bypassed through LLM jailbreaks. Current jailbreaks often require significant manual effort, high computational costs, or result in excessive model modifications that may degrade regular utility. We introduce TwinBreak, an innovative safety alignment removal method. Building on the idea that the safety mechanism operates like an embedded backdoor, TwinBreak identifies and prunes parameters responsible for this functionality. By focusing on the most relevant model layers, TwinBreak performs fine-grained analysis of parameters essential to model utility and safety. TwinBreak is the first method to analyze intermediate outputs from prompts with high structural and content similarity to isolate safety parameters. We present the TwinPrompt dataset containing 100 such twin prompts. Experiments confirm TwinBreak's effectiveness, achieving 89% to 98% success rates with minimal computational requirements across 16 LLMs from five vendors.

Torsten Krau{\ss}, Hamid Dashtbani, Alexandra Dmitrienko• 2025

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	HellaSwag	Accuracy80.5	1896
Commonsense Reasoning	WinoGrande	Accuracy76.8	1442
Multi-task Language Understanding	MMLU	Accuracy68.9	881
Instruction Following	IFEval	IFEval Accuracy74.9	836
Jailbreak Attack	HarmBench	Attack Success Rate (ASR)98	557
Jailbreak Attack	StrongREJECT	Attack Success Rate63.1	262
Jailbreaking	AdvBench	--	132
Math Reasoning	GSM8K	Accuracy88.5	126
Truthfulness Evaluation	TruthfulQA	Accuracy58.1	108
Jailbreak	Sorry	Jailbreak Rate97.7	70

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord