Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought

About

Large Language Models (LLMs) face a fundamental safety-helpfulness trade-off due to static, one-size-fits-all safety policies that lack runtime controllabilityxf, making it difficult to tailor responses to diverse application needs. %As a result, models may over-refuse benign requests or under-constrain harmful ones. We present \textbf{PACT} (Prompt-configured Action via Chain-of-Thought), a framework for dynamic safety control through explicit, risk-aware reasoning. PACT operates under a hierarchical policy architecture: a non-overridable global safety policy establishes immutable boundaries for critical risks (e.g., child safety, violent extremism), while user-defined policies can introduce domain-specific (non-global) risk categories and specify label-to-action behaviors to improve utility in real-world deployment settings. The framework decomposes safety decisions into structured Classify$\rightarrow$Act paths that route queries to the appropriate action (comply, guide, or reject) and render the decision-making process transparent. Extensive experiments demonstrate that PACT achieves near state-of-the-art safety performance under global policy evaluation while attaining the best controllability under user-specific policy evaluation, effectively mitigating the safety-helpfulness trade-off. We will release the PACT model suite, training data, and evaluation protocols to facilitate reproducible research in controllable safety alignment.

Jianfeng Si, Lin Sun, Weihong Lin, Xiangzheng Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Safety EvaluationWildJailbreak--
53
Safety EvaluationS-Eval (base)
Safety Score98.2
9
Safety EvaluationS-Eval attack
Safety Score91.9
9
Safety EvaluationJBB-Behaviors
Safety Score98.7
9
Runtime ControllabilityCoSApien (test)
Allow: G+/S83.3
5
Runtime ControllabilityPACT (test)
Comp G+/S99.4
5
Showing 6 of 6 rows

Other info

Follow for update