Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments

About

Large Reasoning Models (LRMs) have demonstrated impressive performance in reasoning-intensive tasks, but they remain vulnerable to harmful content generation, particularly in the mid-to-late steps of their reasoning processes. Current defense methods, however, depend on costly fine-tuning and additional expert knowledge, which limits their scalability. In this work, we propose ReasoningGuard, an inference-time safeguard for LRMs. It injects timely safety aha moments during the reasoning process to guide the model towards harmless yet helpful reasoning. Our approach leverages the internal attention mechanisms of the LRM to accurately identify key points in the reasoning path, triggering safety-oriented reflections. To safeguard both the subsequent reasoning steps and the final answers, we implement a scaling sampling strategy during decoding to select the optimal reasoning path. With minimal additional inference cost, ReasoningGuard effectively mitigates four types of jailbreak attacks, including recent ones targeting the reasoning process of LRMs. Our approach outperforms nine existing safeguards, providing state-of-the-art defenses while avoiding common exaggerated safety issues.

Yuquan Wang, Mi Zhang, Yining Wang, Geng Hong, Mi Wen, Xiaoyu You, Min Yang• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH
Accuracy94.2
535
Mathematical ReasoningAIME
AIME Accuracy76.7
288
Science ReasoningGPQA
Accuracy60.6
243
Jailbreak DefenseWild Jailbreak
ASR4.9
114
Safety EvaluationXSTest Safe
FC4
78
Safety EvaluationXSTest Unsafe
False Compliance Rate (FC)0.00e+0
78
Mathematical ReasoningMATH500
Pass@194.5
77
Safety EvaluationAdvBench
Reasoning Harmfulness Rate0.00e+0
50
Jailbreak AttackPAIR
Harmful Score0.00e+0
46
Harmful Request DefenseAdvBench
ASR0.00e+0
44
Showing 10 of 40 rows

Other info

Follow for update