RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability

About

Large Reasoning Models (LRMs), such as OpenAI o1 and DeepSeek-R1, have been rapidly progressing and achieving breakthrough performance on complex reasoning tasks such as mathematics and coding. However, the open-source R1 models have raised safety concerns in wide applications, such as the tendency to comply with malicious queries, which greatly impacts the utility of these powerful models in their applications. In this paper, we introduce RealSafe-R1 as safety-aligned versions of DeepSeek-R1 distilled models. To train these models, we construct a dataset of 15k safety-aware reasoning trajectories generated by DeepSeek-R1, under explicit instructions for expected refusal behavior. Both quantitative experiments and qualitative case studies demonstrate the models' improvements, which are shown in their safety guardrails against both harmful queries and jailbreak attacks. Importantly, unlike prior safety alignment efforts that often compromise reasoning performance, our method preserves the models' reasoning capabilities by maintaining the training data within the original distribution of generation. Model weights of RealSafe-R1 are open-source at https://huggingface.co/RealSafe.

Yichi Zhang, Zihao Zeng, Dongbai Li, Yao Huang, Zhijie Deng, Yinpeng Dong• 2025

Related benchmarks

Task	Dataset	Result
Code Generation	HumanEval	Pass@179.2	171
Over-refusal	XSTest	Overrefusal Rate16	102
Multitask Knowledge	MMLU	Accuracy72.7	92
Safety Evaluation	XSTest Unsafe	False Refusal Rate (FR)38	84
Safety Evaluation	XSTest Safe	FC32	78
Mathematical Reasoning	Minerva	Pass@135.8	78
Mathematical Reasoning	MATH500	Pass@179.1	77
Mathematical Reasoning	MATH 500	Pass@189.8	68
Safety Evaluation	AdvBench	Reasoning Harmfulness Rate0.00e+0	50
Scientific Reasoning	GPQA Diamond	--	48

Showing 10 of 55 rows

Other info

Follow for update

@wizwand_team Discord