Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Chain-of-Thought Hijacking

About

Large Reasoning Models (LRMs) improve task performance through extended inference-time reasoning. Although previous studies suggest that longer reasoning should lead to more robust safety behavior, we find evidence to the contrary: over-extended reasoning can instead be exploited to systematically weaken refusal behavior. We propose Chain-of-Thought Hijacking, a simple yet effective black-box jailbreak attack that induces LRMs to engage in prolonged benign puzzle-solving reasoning, often lasting more than five minutes, before eliciting harmful compliance. Across HarmBench, CoT Hijacking achieves attack success rates of 99%, 94%, 100%, and 94% on Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet, respectively. To understand why this attack succeeds, we conduct activation probing, attention-pattern analysis, and causal interventions on open-source reasoning models. Our results indicate that refusal behavior depends on a low-dimensional safety signal whose expression weakens as reasoning traces grow longer. In particular, extended benign reasoning shifts attention away from harmful intentions and attenuates refusal-related activations, producing what we call refusal dilution. These findings demonstrate that excessively prolonged reasoning can introduce a systematic jailbreak attack surface. We release our evaluation materials to support reproducibility and further research.

Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, Fazl Barez• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K (test)
Accuracy90
954
Jailbreak AttackHarmBench
Attack Success Rate (ASR)100
557
Language UnderstandingMMLU (test)
MMLU Average Accuracy73
167
Adversarial AttackAdvBench Hijacked (test)
CHR26
27
Adversarial AttackStrongREJECT Hijacked (test)
CHR22
27
Adversarial AttackStrongREJECT Original (test)
CHR3
27
Adversarial AttackAdvBench Original (test)
CHR1
27
Showing 7 of 7 rows

Other info

Follow for update