Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable

About

Safety alignment is an important procedure before the official deployment of a Large Language Model (LLM). While safety alignment has been extensively studied for LLM, there is still a large research gap for Large Reasoning Models (LRMs) that equip with improved reasoning capability. We in this paper systematically examine a simplified pipeline for producing safety aligned LRMs. With our evaluation of various LRMs, we deliver two main findings: i) Safety alignment can be done upon the LRM to restore its safety capability. ii) Safety alignment leads to a degradation of the reasoning capability of LRMs. The two findings show that there exists a trade-off between reasoning and safety capability with the sequential LRM production pipeline. The discovered trade-off, which we name Safety Tax, should shed light on future endeavors of safety research on LRMs. As a by-product, we curate a dataset called DirectRefusal, which might serve as an alternative dataset for safety alignment. Our source code is available at https://github.com/git-disl/Safety-Tax.

Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Yichang Xu, Ling Liu• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	pass@189.82	239
Over-refusal	XSTest	Overrefusal Rate840	102
Safety Evaluation	StrongREJECT	Attack Success Rate0.64	77
Mathematical Reasoning	AIME 2024	Pass@147.5	54
Reasoning	GPQA	Pass@144.95	45
Reasoning and Code Generation	Reasoning Evaluation Suite (GSM8K, MATH500, AIME24, HumanEval) (test)	GSM8K Accuracy90.4	36
Over-refusal evaluation	Over-refusal (test)	Refusal Rate14.8	36
Harmfulness Evaluation	Harmfulness Evaluation Suite JBB, SR, WJ, GCG, JBC, PAIR (test)	JBB9	36
Reasoning	Reasoning Evaluation Suite AIME 2024, GSM8k, MATH 500, GPQA	AIME 2024 Score0.7458	32
Safety Evaluation	Safety Evaluation Suite HarmBench, StrongReject, WildJailbreak, XSTest	HarmBench Score43.85	28

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord