Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TAO-Attack: Toward Advanced Optimization-Based Jailbreak Attacks for Large Language Models

About

Large language models (LLMs) have achieved remarkable success across diverse applications but remain vulnerable to jailbreak attacks, where attackers craft prompts that bypass safety alignment and elicit unsafe responses. Among existing approaches, optimization-based attacks have shown strong effectiveness, yet current methods often suffer from frequent refusals, pseudo-harmful outputs, and inefficient token-level updates. In this work, we propose TAO-Attack, a new optimization-based jailbreak method. TAO-Attack employs a two-stage loss function: the first stage suppresses refusals to ensure the model continues harmful prefixes, while the second stage penalizes pseudo-harmful outputs and encourages the model toward more harmful completions. In addition, we design a direction-priority token optimization (DPTO) strategy that improves efficiency by aligning candidates with the gradient direction before considering update magnitude. Extensive experiments on multiple LLMs demonstrate that TAO-Attack consistently outperforms state-of-the-art methods, achieving higher attack success rates and even reaching 100\% in certain scenarios.

Zhi Xu, Jiaqi Li, Xiaotong Zhang, Hong Yu, Han Liu• 2026

Related benchmarks

TaskDatasetResultRank
Token-forcing loss optimizationRandom targets Held-out (val)
Qwen-2.5-7B Loss5.16
56
Jailbreak AttackAdvBench 150 Harmful Behaviors
ASR100
45
Jailbreaking AttackMM-SafetyBench
Attack Success Rate (ASR)95
20
JailbreakingVicuna 13B
ASR100
6
Jailbreak AttackHarmBench behaviors
Attack Success Rate (ASR)100
4
Jailbreak TransferabilityI-GCG Universal Suffix Transfer to GPT-3.5 Turbo
ASR82
3
Jailbreak TransferabilityI-GCG Universal Suffix Transfer to GPT-4 Turbo
ASR8
3
Jailbreak TransferabilityI-GCG Universal Suffix Transfer to Gemini 1.5 Flash
ASR6
3
Jailbreak TransferabilityI-GCG Universal Suffix Transfer to Gemini 2 Flash
Attack Success Rate (ASR)4
3
JailbreakAdvBench Multiple-choice format (full)
Safe Option Probability41
3
Showing 10 of 18 rows

Other info

Follow for update