Reasoning Structure Matters for Safety Alignment of Reasoning Models

About

Large reasoning models (LRMs) achieve strong performance on complex reasoning tasks but often generate harmful responses to malicious user queries. This paper investigates the underlying cause of these safety risks and shows that the issue lies in the reasoning structure itself. Based on this insight, we claim that effective safety alignment can be achieved by altering the reasoning structure. We propose AltTrain, a simple yet effective post training method that explicitly alters the reasoning structure of LRMs. AltTrain is both practical and generalizable, requiring no complex reinforcement learning (RL) training or reward design, only supervised finetuning (SFT) with a lightweight 1K training examples. Experiments across LRM backbones and model sizes demonstrate strong safety alignment, along with robust generalization across reasoning, QA, summarization, and multilingual setting.

Yeonjun In, Wonjoong Kim, Sangwu Park, Chanyoung Park• 2026

Related benchmarks

Task	Dataset	Result
Question Answering	NQ	F1 Score (NQ)72	64
Reasoning and Code Generation	Reasoning Evaluation Suite (GSM8K, MATH500, AIME24, HumanEval) (test)	GSM8K Accuracy92.7	36
Harmfulness Evaluation	Harmfulness Evaluation Suite JBB, SR, WJ, GCG, JBC, PAIR (test)	JBB13	36
Over-refusal evaluation	Over-refusal (test)	Refusal Rate6	36
Multilingual Understanding	CMMLU	Score80	32
Question Answering	Natural Questions (NQ)	NQ Score82	26
Multilingual Knowledge	CMMLU	CMMLU Score60.5	6
Multi-turn Jailbreak Attack Robustness	Crescendomation	Harmful Rate15	4
Summarization	cnn	CNN Score13.6	4

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord