SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment

About

Large Reasoning Models (LRMs) have become powerful tools for complex problem solving, but their structured reasoning pathways can lead to unsafe outputs when exposed to harmful prompts. Existing safety alignment methods reduce harmful outputs but can degrade reasoning depth, leading to significant trade-offs in complex, multi-step tasks, and remain vulnerable to sophisticated jailbreak attacks. To address this, we introduce SAFEPATH, a lightweight alignment method that fine-tunes LRMs to emit a short, 8-token Safety Primer at the start of their reasoning, in response to harmful prompts, while leaving the rest of the reasoning process unsupervised. Empirical results across multiple benchmarks indicate that SAFEPATH effectively reduces harmful outputs while maintaining reasoning performance. Specifically, SAFEPATH reduces harmful responses by up to 90.0% and blocks 83.3% of jailbreak attempts in the DeepSeek-R1-Distill-Llama-8B model, while requiring 295.9x less compute than Direct Refusal and 314.1x less than SafeChain. We further introduce a zero-shot variant that requires no fine-tuning. In addition, we provide a comprehensive analysis of how existing methods in LLMs generalize, or fail, when applied to reasoning-centric models, revealing critical gaps and new directions for safer AI.

Wonje Jeung, Sangyeon Yoon, Minsuk Kahng, Albert No• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MathVista	Accuracy60.86	382
Mathematical Reasoning	MATH 500	pass@190.62	239
Reasoning	GPQA Diamond	Accuracy42.1	185
Coding	HumanEval+	Pass@137.2	164
Coding	MBPP+	Pass@136.8	117
Over-refusal	XSTest	Overrefusal Rate0.00e+0	102
Reasoning	GPQA	Pass@147.41	92
Safety Evaluation	WildJailbreak	ASR0.1665	90
Safety Evaluation	XSTest Unsafe	False Refusal Rate (FR)6	84
Safety Evaluation	XSTest Safe	FC28	78

Showing 10 of 61 rows

Other info

Follow for update

@wizwand_team Discord