Chain-of-Thought Hijacking

About

Large Reasoning Models (LRMs) improve task performance through extended inference-time reasoning. Although previous studies suggest that longer reasoning should lead to more robust safety behavior, we find evidence to the contrary: over-extended reasoning can instead be exploited to systematically weaken refusal behavior. We propose Chain-of-Thought Hijacking, a simple yet effective black-box jailbreak attack that induces LRMs to engage in prolonged benign puzzle-solving reasoning, often lasting more than five minutes, before eliciting harmful compliance. Across HarmBench, CoT Hijacking achieves attack success rates of 99%, 94%, 100%, and 94% on Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet, respectively. To understand why this attack succeeds, we conduct activation probing, attention-pattern analysis, and causal interventions on open-source reasoning models. Our results indicate that refusal behavior depends on a low-dimensional safety signal whose expression weakens as reasoning traces grow longer. In particular, extended benign reasoning shifts attention away from harmful intentions and attenuates refusal-related activations, producing what we call refusal dilution. These findings demonstrate that excessively prolonged reasoning can introduce a systematic jailbreak attack surface. We release our evaluation materials to support reproducibility and further research.

Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, Fazl Barez• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K (test)	Accuracy90	954
Jailbreak Attack	HarmBench	Attack Success Rate (ASR)100	557
Language Understanding	MMLU (test)	MMLU Average Accuracy73	167
Adversarial Attack	AdvBench Hijacked (test)	CHR26	27
Adversarial Attack	StrongREJECT Hijacked (test)	CHR22	27
Adversarial Attack	StrongREJECT Original (test)	CHR3	27
Adversarial Attack	AdvBench Original (test)	CHR1	27

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord