Chain-of-Thought Hijacking
About
Large Reasoning Models (LRMs) improve task performance through extended inference-time reasoning. Although previous studies suggest that longer reasoning should lead to more robust safety behavior, we find evidence to the contrary: over-extended reasoning can instead be exploited to systematically weaken refusal behavior. We propose Chain-of-Thought Hijacking, a simple yet effective black-box jailbreak attack that induces LRMs to engage in prolonged benign puzzle-solving reasoning, often lasting more than five minutes, before eliciting harmful compliance. Across HarmBench, CoT Hijacking achieves attack success rates of 99%, 94%, 100%, and 94% on Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet, respectively. To understand why this attack succeeds, we conduct activation probing, attention-pattern analysis, and causal interventions on open-source reasoning models. Our results indicate that refusal behavior depends on a low-dimensional safety signal whose expression weakens as reasoning traces grow longer. In particular, extended benign reasoning shifts attention away from harmful intentions and attenuates refusal-related activations, producing what we call refusal dilution. These findings demonstrate that excessively prolonged reasoning can introduce a systematic jailbreak attack surface. We release our evaluation materials to support reproducibility and further research.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | GSM8K (test) | Accuracy90 | 954 | |
| Jailbreak Attack | HarmBench | Attack Success Rate (ASR)100 | 557 | |
| Language Understanding | MMLU (test) | MMLU Average Accuracy73 | 167 | |
| Adversarial Attack | AdvBench Hijacked (test) | CHR26 | 27 | |
| Adversarial Attack | StrongREJECT Hijacked (test) | CHR22 | 27 | |
| Adversarial Attack | StrongREJECT Original (test) | CHR3 | 27 | |
| Adversarial Attack | AdvBench Original (test) | CHR1 | 27 |