STAIR: Improving Safety Alignment with Introspective Reasoning

About

Ensuring the safety and harmlessness of Large Language Models (LLMs) has become equally critical as their performance in applications. However, existing safety alignment methods typically suffer from safety-performance trade-offs and the susceptibility to jailbreak attacks, primarily due to their reliance on direct refusals for malicious queries. In this paper, we propose STAIR, a novel framework that integrates SafeTy Alignment with Itrospective Reasoning. We enable LLMs to identify safety risks through step-by-step analysis by self-improving chain-of-thought (CoT) reasoning with safety awareness. STAIR first equips the model with a structured reasoning capability and then advances safety alignment via iterative preference optimization on step-level reasoning data generated using our newly proposed Safety-Informed Monte Carlo Tree Search (SI-MCTS). We further train a process reward model on this data to guide test-time searches for improved responses. Extensive experiments show that STAIR effectively mitigates harmful outputs while better preserving helpfulness, compared to instinctive alignment strategies. With test-time scaling, STAIR achieves a safety performance comparable to Claude-3.5 against popular jailbreak attacks. Relevant resources in this work are available at https://github.com/thu-ml/STAIR.

Yichi Zhang, Siyuan Zhang, Yao Huang, Zeyu Xia, Zhengwei Fang, Xiao Yang, Ranjie Duan, Dong Yan, Yinpeng Dong, Jun Zhu• 2025

Related benchmarks

Task	Dataset	Result
Scientific Question Answering	GPQA Diamond	Accuracy48.98	131
Massive Multitask Language Understanding	MMLU-Pro	Accuracy (MMLU-Pro)44.92	122
Over-refusal	XSTest	--	102
Mathematical Reasoning	GSM8K	Accuracy87.6	80
Safety Evaluation	StrongREJECT	--	77
Mathematical Reasoning	MATH500	Accuracy (%)83.8	56
Safety Evaluation	HarmBench	PAIR78.75	39
Safety Evaluation	WildChat	Safe@177.8	34
Specification Alignment	SPECBENCH Average over scenarios	Safety Score89.27	33
Safety Evaluation	AdvBench	Overall Safety Score100	30

Showing 10 of 54 rows

Other info

Follow for update

@wizwand_team Discord