Chain-of-Thought Hijacking
About
Large Reasoning Models (LRMs) improve task performance through extended inference-time reasoning. While prior work suggests this should strengthen safety, we find evidence to the contrary. Long reasoning sequences can be exploited to systematically weaken them. We introduce Chain-of-Thought Hijacking, a jailbreak attack that prepends harmful instructions with extended sequences of benign puzzle reasoning. Across HarmBench, CoT Hijacking achieves attack success rates of 99\%, 94\%, 100\%, and 94\% on Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet. To understand this mechanism, we apply activation probing, attention analysis, and causal interventions. We find that refusal depends on a low-dimensional safety signal that becomes diluted as reasoning grows: mid-layers encode the strength of safety checking, while late layers encode the refusal outcome. These findings demonstrate that explicit chain-of-thought reasoning introduces a systematic vulnerability when combined with answer-prompting cues. We release all evaluation materials to facilitate replication.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | GSM8K (test) | Accuracy90 | 900 | |
| Language Understanding | MMLU (test) | MMLU Average Accuracy73 | 163 | |
| Adversarial Attack | AdvBench Hijacked (test) | CHR26 | 27 | |
| Adversarial Attack | StrongREJECT Hijacked (test) | CHR22 | 27 | |
| Adversarial Attack | StrongREJECT Original (test) | CHR3 | 27 | |
| Adversarial Attack | AdvBench Original (test) | CHR1 | 27 |