Safety Tax: Safety Alignment Makes Your Large Reasoning Models Less Reasonable
About
Safety alignment is an important procedure before the official deployment of a Large Language Model (LLM). While safety alignment has been extensively studied for LLM, there is still a large research gap for Large Reasoning Models (LRMs) that equip with improved reasoning capability. We in this paper systematically examine a simplified pipeline for producing safety aligned LRMs. With our evaluation of various LRMs, we deliver two main findings: i) Safety alignment can be done upon the LRM to restore its safety capability. ii) Safety alignment leads to a degradation of the reasoning capability of LRMs. The two findings show that there exists a trade-off between reasoning and safety capability with the sequential LRM production pipeline. The discovered trade-off, which we name Safety Tax, should shed light on future endeavors of safety research on LRMs. As a by-product, we curate a dataset called DirectRefusal, which might serve as an alternative dataset for safety alignment. Our source code is available at https://github.com/git-disl/Safety-Tax.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH 500 | pass@189.82 | 153 | |
| Mathematical Reasoning | AIME 2024 | Pass@147.5 | 54 | |
| Safety Evaluation | StrongREJECT | Attack Success Rate0.64 | 45 | |
| Over-refusal | XSTest | -- | 42 | |
| Reasoning | Reasoning Evaluation Suite AIME 2024, GSM8k, MATH 500, GPQA | AIME 2024 Score0.7458 | 32 | |
| Safety Evaluation | Safety Evaluation Suite HarmBench, StrongReject, WildJailbreak, XSTest | HarmBench Score43.85 | 28 | |
| Harmfulness Evaluation | HarmBench | Harmful Response Ratio32.39 | 21 | |
| Safety | WildJailbreak | Harmful Response Ratio32.6 | 21 | |
| Reasoning | GSM8K | Pass@188.27 | 21 | |
| Reasoning | GPQA | Pass@144.95 | 21 |