ConsisGuard: Aligning Safety Deliberation with Policy Enforcement in LLM Guardrails
About
Reasoning-based LLM guardrails improve safety moderation by generating explicit rationales before issuing final decisions. However, their rationales do not always lead to faithful enforcement: a model may recognize a harmful intent in its reasoning but still predict a safe label, or issue an unsafe decision without policy-grounded justification. We identify this safety-critical failure mode as the deliberation-to-enforcement gap. Unlike general chain-of-thought faithfulness, guardrail reliability requires policy execution consistency: the generated reasoning should be grounded in the safety policy, and the final decision should be entailed by that reasoning. We propose ConsisGuard, a consistency-aware framework for reasoning-based LLM guardrails. ConsisGuard performs Policy-to-Decision Trajectory Distillation and Functional Coupling Alignment, aligning the internal coupling between safety deliberation and decision enforcement. Experiments on prompt and response harmfulness detection benchmarks show that ConsisGuard improves detection performance while reducing policy execution failures. These results suggest that reliable reasoning-based guardrails require accurate faithful execution of safety policies.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Response Harmfulness Detection | HarmBench | F1 Score98.29 | 100 | |
| Response Harmfulness Detection | XSTEST-RESP | Response Harmfulness F195.33 | 76 | |
| Response Harmfulness Detection | Beavertails | F1 Score88.3 | 59 | |
| Harmfulness Detection | WildGuard | Macro F1 Score89.96 | 47 | |
| Toxicity Detection | ToxicChat | F1 Score0.7905 | 45 | |
| Harmfulness Detection | OpenAI Moderation | Macro F1 Score74.35 | 45 | |
| Prompt Harmfulness Detection | AegisSafety (test) | F1 Score91.02 | 41 | |
| Response Harmfulness Detection | SafeRLHF | F1 Score69.98 | 41 |