Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories

About

Large Reasoning Models possess remarkable capabilities for self-correction in general domain; however, they frequently struggle to recover from unsafe reasoning trajectories under adversarial attacks. Existing alignment methods attempt to mitigate this vulnerability by fine-tuning the model on expert data including reflection traces or adversarial prefixes. Crucially, these approaches are often hindered by static training data which inevitably deviate from model's dynamic, on-policy reasoning traces, resulting in model hardly covering its vast generation space and learning to recover from its own failures. To bridge this gap, we propose Self-ReSET, a pure reinforcement learning framework designed to equip LRMs with the intrinsic capacity to recover from their own safety error trajectories, which are subsequently reused as an initial state for reinforcement learning. Extensive experiments across various LRMs and benchmarks demonstrate that Self-ReSET significantly enhances robustness against adversarial attacks especially out-of-distribution (OOD) jailbreak prompts while maintaining general utility, along with efficient data utilization. Further analysis reveals that our method effectively fosters self-recovery patterns, enabling models to better identify and recover from unsafe intermediate error states back to benign paths. Our codes and data are available at https://github.com/Ing1024/Self-ReSET.

Dongcheng Zhang, Yi Zhang, Yuxin Chen, An Zhang, Xiang Wang, Chaochao Lu• 2026

Related benchmarks

TaskDatasetResultRank
Jailbreaking Safety EvaluationFORTRESS
Safety Score87.7
30
Mathematical ReasoningAIME24
AIME24 Avg@1662.3
26
Over-refusal evaluationXSTest
Evaluation Score (avg@4)98.4
26
Harmful Content SafetyHarmBench (HB)
Evaluation Score (avg@4)98.4
18
Jailbreak RobustnessWildTeaming WJ
Evaluation Score (avg@4)95.1
18
Jailbreak Robustnesssafe-unlearning
Avg Evaluation Score (k=4)98
18
Jailbreak RobustnessJB-R1
Evaluation Score (avg@4)98.1
18
Harmful Content SafetyStrongReject (SR)
Evaluation Score (avg@4)100
18
Mathematical ReasoningAIME 2024
Pass@1686.7
12
Safety RecoveryH-CoT DeepSeek-R1
Recovery Rate37
12
Showing 10 of 11 rows

Other info

Follow for update