Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories
About
Large Reasoning Models possess remarkable capabilities for self-correction in general domain; however, they frequently struggle to recover from unsafe reasoning trajectories under adversarial attacks. Existing alignment methods attempt to mitigate this vulnerability by fine-tuning the model on expert data including reflection traces or adversarial prefixes. Crucially, these approaches are often hindered by static training data which inevitably deviate from model's dynamic, on-policy reasoning traces, resulting in model hardly covering its vast generation space and learning to recover from its own failures. To bridge this gap, we propose Self-ReSET, a pure reinforcement learning framework designed to equip LRMs with the intrinsic capacity to recover from their own safety error trajectories, which are subsequently reused as an initial state for reinforcement learning. Extensive experiments across various LRMs and benchmarks demonstrate that Self-ReSET significantly enhances robustness against adversarial attacks especially out-of-distribution (OOD) jailbreak prompts while maintaining general utility, along with efficient data utilization. Further analysis reveals that our method effectively fosters self-recovery patterns, enabling models to better identify and recover from unsafe intermediate error states back to benign paths. Our codes and data are available at https://github.com/Ing1024/Self-ReSET.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Jailbreaking Safety Evaluation | FORTRESS | Safety Score87.7 | 30 | |
| Mathematical Reasoning | AIME24 | AIME24 Avg@1662.3 | 26 | |
| Over-refusal evaluation | XSTest | Evaluation Score (avg@4)98.4 | 26 | |
| Harmful Content Safety | HarmBench (HB) | Evaluation Score (avg@4)98.4 | 18 | |
| Jailbreak Robustness | WildTeaming WJ | Evaluation Score (avg@4)95.1 | 18 | |
| Jailbreak Robustness | safe-unlearning | Avg Evaluation Score (k=4)98 | 18 | |
| Jailbreak Robustness | JB-R1 | Evaluation Score (avg@4)98.1 | 18 | |
| Harmful Content Safety | StrongReject (SR) | Evaluation Score (avg@4)100 | 18 | |
| Mathematical Reasoning | AIME 2024 | Pass@1686.7 | 12 | |
| Safety Recovery | H-CoT DeepSeek-R1 | Recovery Rate37 | 12 |