Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ReasoningGuard: Safeguarding Large Reasoning Models with Inference-time Safety Aha Moments

About

Large Reasoning Models (LRMs) have demonstrated impressive performance in reasoning-intensive tasks, but they remain vulnerable to harmful content generation, particularly in the mid-to-late steps of their reasoning processes. Existing defense mechanisms, however, rely on costly fine-tuning and additional expert knowledge, which restricts their scalability. In this work, we propose ReasoningGuard, an inference-time safeguard for LRMs, which injects timely safety aha moments to steer harmless while helpful reasoning processes. Leveraging the model's internal attention behavior, our approach accurately identifies critical points in the reasoning path, and triggers spontaneous, safety-oriented reflection. To safeguard both the subsequent reasoning steps and the final answers, we further implement a scaling sampling strategy during the decoding phase, selecting the optimal reasoning path. Inducing minimal extra inference cost, ReasoningGuard effectively mitigates three types of jailbreak attacks, including the latest ones targeting the reasoning process of LRMs. Our approach outperforms seven existing safeguards, achieving state-of-the-art safety defenses while effectively avoiding the common exaggerated safety issues.

Yuquan Wang, Mi Zhang, Yining Wang, Geng Hong, Xiaoyu You, Min Yang• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH
Accuracy94.2
535
Mathematical ReasoningAIME
AIME Accuracy76.7
283
Science ReasoningGPQA
Accuracy60.6
218
Harmful Request DefenseAdvBench
ASR0.00e+0
44
Jailbreak DefenseWild Jailbreak
ASR4.9
36
Red-teaming Safety EvaluationHarmBench
ASR1.5
32
Jailbreak Attack DefensePAIR
ASR1
24
Jailbreak Attack DefenseFORTRESS
ASR9.8
24
Over-refusal AssessmentXS (test)
F1 Score93.1
24
General ReasoningMMLU-P
Accuracy75.6
24
Showing 10 of 12 rows

Other info

Follow for update