Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Chain-of-Thought Hijacking

About

Large Reasoning Models (LRMs) improve task performance through extended inference-time reasoning. While prior work suggests this should strengthen safety, we find evidence to the contrary. Long reasoning sequences can be exploited to systematically weaken them. We introduce Chain-of-Thought Hijacking, a jailbreak attack that prepends harmful instructions with extended sequences of benign puzzle reasoning. Across HarmBench, CoT Hijacking achieves attack success rates of 99\%, 94\%, 100\%, and 94\% on Gemini 2.5 Pro, ChatGPT o4 Mini, Grok 3 Mini, and Claude 4 Sonnet. To understand this mechanism, we apply activation probing, attention analysis, and causal interventions. We find that refusal depends on a low-dimensional safety signal that becomes diluted as reasoning grows: mid-layers encode the strength of safety checking, while late layers encode the refusal outcome. These findings demonstrate that explicit chain-of-thought reasoning introduces a systematic vulnerability when combined with answer-prompting cues. We release all evaluation materials to facilitate replication.

Jianli Zhao, Tingchen Fu, Rylan Schaeffer, Mrinank Sharma, Fazl Barez• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K (test)
Accuracy90
900
Language UnderstandingMMLU (test)
MMLU Average Accuracy73
163
Adversarial AttackAdvBench Hijacked (test)
CHR26
27
Adversarial AttackStrongREJECT Hijacked (test)
CHR22
27
Adversarial AttackStrongREJECT Original (test)
CHR3
27
Adversarial AttackAdvBench Original (test)
CHR1
27
Showing 6 of 6 rows

Other info

Follow for update