Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories

About

Large Reasoning Models possess remarkable capabilities for self-correction in general domain; however, they frequently struggle to recover from unsafe reasoning trajectories under adversarial attacks. Existing alignment methods attempt to mitigate this vulnerability by fine-tuning the model on expert data including reflection traces or adversarial prefixes. Crucially, these approaches are often hindered by static training data which inevitably deviate from model's dynamic, on-policy reasoning traces, resulting in model hardly covering its vast generation space and learning to recover from its own failures. To bridge this gap, we propose Self-ReSET, a pure reinforcement learning framework designed to equip LRMs with the intrinsic capacity to recover from their own safety error trajectories, which are subsequently reused as an initial state for reinforcement learning. Extensive experiments across various LRMs and benchmarks demonstrate that Self-ReSET significantly enhances robustness against adversarial attacks especially out-of-distribution (OOD) jailbreak prompts while maintaining general utility, along with efficient data utilization. Further analysis reveals that our method effectively fosters self-recovery patterns, enabling models to better identify and recover from unsafe intermediate error states back to benign paths. Our codes and data are available at https://github.com/Ing1024/Self-ReSET.

Dongcheng Zhang, Yi Zhang, Yuxin Chen, An Zhang, Xiang Wang, Chaochao Lu• 2026

Related benchmarks

Task	Dataset	Result
Over-refusal evaluation	XSTest	Evaluation Score (avg@4)98.4	70
Jailbreaking Safety Evaluation	FORTRESS	Safety Score87.7	34
Mathematical Reasoning	AIME24	AIME24 Avg@1662.3	26
Harmful Content Safety	HarmBench (HB)	Evaluation Score (avg@4)98.4	18
Jailbreak Robustness	WildTeaming WJ	Evaluation Score (avg@4)95.1	18
Jailbreak Robustness	safe-unlearning	Avg Evaluation Score (k=4)98	18
Jailbreak Robustness	JB-R1	Evaluation Score (avg@4)98.1	18
Harmful Content Safety	StrongReject (SR)	Evaluation Score (avg@4)100	18
Mathematical Reasoning	AIME 2024	Pass@1686.7	12
Safety Recovery	H-CoT DeepSeek-R1	Recovery Rate37	12

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord