Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable

About

Safety alignment is an important procedure before the official deployment of a Large Language Model (LLM). While safety alignment has been extensively studied for LLM, there is still a large research gap for Large Reasoning Models (LRMs) that equip with improved reasoning capability. We in this paper systematically examine a simplified pipeline for producing safety aligned LRMs. With our evaluation of various LRMs, we deliver two main findings: i) Safety alignment can be done upon the LRM to restore its safety capability. ii) Safety alignment leads to a degradation of the reasoning capability of LRMs. The two findings show that there exists a trade-off between reasoning and safety capability with the sequential LRM production pipeline. The discovered trade-off, which we name Safety Tax, should shed light on future endeavors of safety research on LRMs. As a by-product, we curate a dataset called DirectRefusal, which might serve as an alternative dataset for safety alignment. Our source code is available at https://github.com/git-disl/Safety-Tax.

Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Yichang Xu, Ling Liu• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH 500
pass@189.82
153
Mathematical ReasoningAIME 2024
Pass@147.5
54
Safety EvaluationStrongREJECT
Attack Success Rate0.64
45
Over-refusalXSTest--
42
ReasoningReasoning Evaluation Suite AIME 2024, GSM8k, MATH 500, GPQA
AIME 2024 Score0.7458
32
Safety EvaluationSafety Evaluation Suite HarmBench, StrongReject, WildJailbreak, XSTest
HarmBench Score43.85
28
Harmfulness EvaluationHarmBench
Harmful Response Ratio32.39
21
SafetyWildJailbreak
Harmful Response Ratio32.6
21
ReasoningGSM8K
Pass@188.27
21
ReasoningGPQA
Pass@144.95
21
Showing 10 of 10 rows

Other info

Follow for update