ConsisGuard: Aligning Safety Deliberation with Policy Enforcement in LLM Guardrails

About

Reasoning-based LLM guardrails improve safety moderation by generating explicit rationales before issuing final decisions. However, their rationales do not always lead to faithful enforcement: a model may recognize a harmful intent in its reasoning but still predict a safe label, or issue an unsafe decision without policy-grounded justification. We identify this safety-critical failure mode as the deliberation-to-enforcement gap. Unlike general chain-of-thought faithfulness, guardrail reliability requires policy execution consistency: the generated reasoning should be grounded in the safety policy, and the final decision should be entailed by that reasoning. We propose ConsisGuard, a consistency-aware framework for reasoning-based LLM guardrails. ConsisGuard performs Policy-to-Decision Trajectory Distillation and Functional Coupling Alignment, aligning the internal coupling between safety deliberation and decision enforcement. Experiments on prompt and response harmfulness detection benchmarks show that ConsisGuard improves detection performance while reducing policy execution failures. These results suggest that reliable reasoning-based guardrails require accurate faithful execution of safety policies.

Yan Wang, Zhixuan Chu, Zihao Xue, Zhen Bi, Bingyu Zhu, YueFeng Chen, Zeyu Yang, Jungang Lou, Longtao Huang, Ningyu Zhang, Kui Ren, Hui Xue• 2026

Related benchmarks

Task	Dataset	Result
Response Harmfulness Detection	HarmBench	F1 Score98.29	100
Response Harmfulness Detection	XSTEST-RESP	Response Harmfulness F195.33	76
Response Harmfulness Detection	Beavertails	F1 Score88.3	59
Harmfulness Detection	OpenAI Moderation	Macro F1 Score74.35	59
Harmfulness Detection	WildGuard	Macro F1 Score89.96	47
Toxicity Detection	ToxicChat	F1 Score0.7905	45
Prompt Harmfulness Detection	AegisSafety (test)	F1 Score91.02	41
Response Harmfulness Detection	SafeRLHF	F1 Score69.98	41

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord