THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

About

Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this safety degradation, recent approaches rely on external teacher distillation, yet this introduces a distributional discrepancy that degrades native reasoning. We formalize safety realignment as a KL projection onto the safe simplex and prove that the student's own safety-filtered distribution is the unique KL-optimal target, while any external teacher incurs an irreducible excess KL penalty. Guided by this analysis, we propose ThinkSafe, a self-generated alignment framework that restores safety without external teachers. Our key insight is that while compliance suppresses safety mechanisms, models often retain latent knowledge to identify harm. ThinkSafe unlocks this via lightweight refusal steering, which preserves the KL-optimal target while increasing the acceptance rate. Experiments on DeepSeek-R1-Distill and Qwen3 show ThinkSafe significantly improves safety while preserving reasoning proficiency, and achieves superior safety and comparable reasoning to GRPO with roughly an order of magnitude less compute. Code, models, and datasets are available at https://github.com/seanie12/ThinkSafe and https://huggingface.co/Seanie-lee/collections.

Seanie Lee, Sangwoo Park, Yumin Choi, Gyeongman Kim, Minki Kang, Jihun Yun, Dongmin Park, Jongho Park, Sung Ju Hwang• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	pass@191.9	239
Over-refusal	XSTest	Overrefusal Rate40	102
Safety Evaluation	StrongREJECT	Attack Success Rate26.52	77
Mathematical Reasoning	AIME 2024	Pass@151.25	54
Reasoning	GPQA	Pass@146.28	45
Reasoning	Reasoning Evaluation Suite AIME 2024, GSM8k, MATH 500, GPQA	AIME 2024 Score0.7333	32
Safety Evaluation	Safety Evaluation Suite HarmBench, StrongReject, WildJailbreak, XSTest	HarmBench Score40.37	28
Harmfulness Evaluation	HarmBench	Harmful Response Ratio27.08	21
Safety	WildJailbreak	Harmful Response Ratio21.15	21
Reasoning	GSM8K	Pass@190.1	21

Showing 10 of 10 rows

Other info

GitHub

Follow for update

@wizwand_team Discord