Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

The Trojan Example: Jailbreaking LLMs through Template Filling and Unsafety Reasoning

About

As Large Language Models (LLMs) become integral to computing infrastructure, safety alignment serves as the primary security control preventing the generation of harmful payloads. However, this defense remains brittle. Existing jailbreak attacks typically bifurcate into white-box methods, which are inapplicable to commercial APIs due to lack of gradient access, and black-box optimization techniques, which often yield unnatural (e.g., syntactically rigid) or non-transferable (e.g., lacking cross-model generalization) prompts. In this work, we introduce TrojFill, a black-box exploitation framework that bypasses safety filters by targeting a fundamental logic flaw in current alignment paradigms: the decoupling of unsafety reasoning from content generation. TrojFill structurally reframes malicious instructions as a template-filling task required for safety analysis. By embedding obfuscated payloads (e.g., via placeholder substitution) into a "Trojan" structure, the attack induces the model to generate prohibited content as a "demonstrative example" ostensibly required for a subsequent sentence-by-sentence safety critique. This approach effectively masks the malicious intent from standard intent classifiers. We evaluate TrojFill against representative commercial systems, including GPT-4o, Gemini-2.5, DeepSeek-3.1, and Qwen-Max. Our results demonstrate that TrojFill achieves near-universal bypass rates: reaching 100% Attack Success Rate (ASR) on Gemini-flash-2.5 and DeepSeek-3.1, and 97% on GPT-4o, significantly outperforming existing black-box baselines. Furthermore, unlike optimization-based adversarial prompts, TrojFill generates highly interpretable and transferable attack vectors, exposing a systematic vulnerability inaligned LLMs.

Mingrui Liu, Sixiao Zhang, Cheng Long, Kwok Yan Lam• 2025

Related benchmarks

TaskDatasetResultRank
Jailbreak AttackJBB-Behaviors
Rule-Judge Score100
56
Showing 1 of 1 rows

Other info

Follow for update