Conceal, Reconstruct, Jailbreak: Exploiting the Reconstruction-Concealment Tradeoff in MLLMs
About
Intent-obfuscation-based jailbreak attacks on multimodal large language models (MLLMs) transform a harmful query into a concealed multimodal input to bypass safety mechanisms. We show that such attacks are governed by a \emph{reconstruction--concealment tradeoff}: the transformed input must hide harmful intent from safety filters while remaining recoverable enough for the victim model to reconstruct the original request. Through a reconstruction analysis of three representative black-box methods, we find that existing transformations struggle to balance this tradeoff, limiting their effectiveness. In contrast, we show that character-removed variants achieve a better balance. Building on this, we propose \emph{concealment-aware variant construction}, which greedily selects character-removed variants that are low in harmful-keyword alignment and mutually diverse, and instantiates them through five modality-aware prompting strategies. We further introduce \emph{keyword-related distractor images} that depict the harmful keyword in diverse contexts, providing more effective auxiliary visual context than generic distractor images. Experiments across closed-source and open-source MLLMs show the proposed strategies outperform strong baselines, revealing an underexplored vulnerability: a model's own reconstruction ability can be exploited to recover hidden harmful intent and produce unsafe responses.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Jailbreak Attack | HarmBench (test) | ASRHB99.73 | 212 | |
| Jailbreak | HarmBench | Toxicity Score1.22 | 50 | |
| Jailbreak Attack | Harmful Query Evaluation Set N=750 GPT-5.4-nano | Toxicity Score2.62 | 10 | |
| Jailbreak Attack | Harmful Query Evaluation Set GPT-5.4-mini N=750 | Toxicity Score3.24 | 10 | |
| Jailbreak Attack | Harmful Query Evaluation Set Gemini-2.5-Flash N=750 | Toxicity4.83 | 10 | |
| Jailbreak Attack | N=750 Harmful Query Evaluation Set Gemini-3.1-Flash-Lite | Toxicity4.76 | 10 | |
| Jailbreak Attack | Harmful Query Evaluation Set Claude Haiku 4.5 N=750 | Toxicity1.54 | 10 |