Towards Safer Large Reasoning Models by Promoting Safety Decision-Making before Chain-of-Thought Generation

About

Large reasoning models (LRMs) achieved remarkable performance via chain-of-thought (CoT), but recent studies showed that such enhanced reasoning capabilities are at the expense of significantly degraded safety capabilities. In this paper, we reveal that LRMs' safety degradation occurs only after CoT is enabled, and this degradation is not observed when CoT is disabled. This observation motivates us to consider encouraging LRMs to make safety decisions before CoT generation. To this end, we propose a novel safety alignment method that promotes the safety decision-making of LRMs before starting CoT generation. Specifically, we first utilize a Bert-based classifier to extract safety decision signals from a safe model (e.g., a CoT-disabled LRM) and then integrate these signals into LRMs' safety alignment as auxiliary supervision. In this way, the safety gradients can be backpropagated to the LRMs' latent representations, effectively strengthening the LRMs' safety decision-making abilities against CoT generation. Extensive experiments demonstrate that our method substantially improves the safety capabilities of LRMs while effectively maintaining LRMs' general reasoning performance.

Jianan Chen, Zhifang Zhang, Shuo He, Linan Yue, Lei Feng, Minling Zhang• 2026

Related benchmarks

Task	Dataset	Result
Reasoning	GPQA Diamond	Accuracy46.3	185
Coding	HumanEval+	Pass@160.3	164
Coding	MBPP+	Pass@148.4	117
Jailbreak Defense	Wild Jailbreak	ASR14	114
Jailbreak Defense	PAIR	ASR0.00e+0	97
Jailbreak Defense	GCG	ASR0.00e+0	91
Safety Evaluation	StrongREJECT	Attack Success Rate5.4	77
Safety Evaluation	WildJailbreak	ASR0.188	70
Jailbreak Defense	JBC	ASR0.00e+0	54
Jailbreak Defense	StrongREJECT	Attack Success Rate2.9	54

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord