The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

About

Large Language Models (LLMs) face prominent security risks from jailbreaking, a practice that manipulates models to bypass built-in security constraints and generate unethical or unsafe content. Among various jailbreak techniques, multi-turn jailbreak attacks are more covert and persistent than single-turn counterparts, exposing critical vulnerabilities of LLMs. However, existing multi-turn jailbreak methods suffer from two fundamental limitations that affect the actual impact in real-world scenarios: (a) As models become more context-aware, any explicit harmful trigger is increasingly likely to be flagged and blocked; (b) Successful final-step triggers often require finely tuned, model-specific contexts, making such attacks highly context-dependent. To fill this gap, we propose \textit{Salami Slicing Risk}, which operates by chaining numerous low-risk inputs that individually evade alignment thresholds but cumulatively accumulate harmful intent to ultimately trigger high-risk behaviors, without heavy reliance on pre-designed contextual structures. Building on this risk, we develop Salami Attack, an automatic framework universally applicable to multiple model types and modalities. Rigorous experiments demonstrate its state-of-the-art performance across diverse models and modalities, achieving over 90\% Attack Success Rate on GPT-4o and Gemini, as well as robustness against real-world alignment defenses. We also proposed a defense strategy to constrain the Salami Attack by at least 44.8\% while achieving a maximum blocking rate of 64.8\% against other multi-turn jailbreak attacks. Our findings provide critical insights into the pervasive risks of multi-turn jailbreaking and offer actionable mitigation strategies to enhance LLM security.

Yihao Zhang, Kai Wang, Jiangrong Wu, Haolin Wu, Yuxuan Zhou, Zeming Wei, Dongxian Wu, Xun Chen, Jun Sun, Meng Sun• 2026

Related benchmarks

Task	Dataset	Result
Jailbreak Attack	HarmBench	Attack Success Rate (ASR)98	557
Jailbreak Attack	JailbreakBench	Attack Success Rate (ASR)96	40
Jailbreak Attack	AdvBench	Attack Success Rate (ASR)86.9	40
Jailbreak Attack	JailbreakBench PAIR	--	15
Jailbreak Attack	VLM Jailbreak Dataset (5,040 text-image pairs) (sample (10%))	ASR56.15	8
Jailbreak Attack Efficiency	AdvBench 100-instance sample	Average Queries2.6	6
Jailbreaking Attack (Crescendo)	Jailbreak Bench	--	5
Jailbreaking Attack (Salami 5-shot)	Jailbreak Bench	--	5
Jailbreak Attack	OVERT Gemini’s Web Interface Real-World Diffusion Model	Bypass Rate16	4
Jailbreak Attack	VLM Jailbreak Dataset Target: Gemini 2.5 Pro	Attack Success Rate (ASR)91	3

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord