Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ConsisGuard: Aligning Safety Deliberation with Policy Enforcement in LLM Guardrails

About

Reasoning-based LLM guardrails improve safety moderation by generating explicit rationales before issuing final decisions. However, their rationales do not always lead to faithful enforcement: a model may recognize a harmful intent in its reasoning but still predict a safe label, or issue an unsafe decision without policy-grounded justification. We identify this safety-critical failure mode as the deliberation-to-enforcement gap. Unlike general chain-of-thought faithfulness, guardrail reliability requires policy execution consistency: the generated reasoning should be grounded in the safety policy, and the final decision should be entailed by that reasoning. We propose ConsisGuard, a consistency-aware framework for reasoning-based LLM guardrails. ConsisGuard performs Policy-to-Decision Trajectory Distillation and Functional Coupling Alignment, aligning the internal coupling between safety deliberation and decision enforcement. Experiments on prompt and response harmfulness detection benchmarks show that ConsisGuard improves detection performance while reducing policy execution failures. These results suggest that reliable reasoning-based guardrails require accurate faithful execution of safety policies.

Yan Wang, Zhixuan Chu, Zihao Xue, Zhen Bi, Bingyu Zhu, YueFeng Chen, Zeyu Yang, Jungang Lou, Longtao Huang, Ningyu Zhang, Kui Ren, Hui Xue• 2026

Related benchmarks

TaskDatasetResultRank
Response Harmfulness DetectionHarmBench
F1 Score98.29
100
Response Harmfulness DetectionXSTEST-RESP
Response Harmfulness F195.33
76
Response Harmfulness DetectionBeavertails
F1 Score88.3
59
Harmfulness DetectionWildGuard
Macro F1 Score89.96
47
Toxicity DetectionToxicChat
F1 Score0.7905
45
Harmfulness DetectionOpenAI Moderation
Macro F1 Score74.35
45
Prompt Harmfulness DetectionAegisSafety (test)
F1 Score91.02
41
Response Harmfulness DetectionSafeRLHF
F1 Score69.98
41
Showing 8 of 8 rows

Other info

Follow for update