Beyond Static Alignment: Hierarchical Policy Control for LLM Safety via Risk-Aware Chain-of-Thought

About

Large Language Models (LLMs) face a fundamental safety-helpfulness trade-off due to static, one-size-fits-all safety policies that lack runtime controllabilityxf, making it difficult to tailor responses to diverse application needs. %As a result, models may over-refuse benign requests or under-constrain harmful ones. We present \textbf{PACT} (Prompt-configured Action via Chain-of-Thought), a framework for dynamic safety control through explicit, risk-aware reasoning. PACT operates under a hierarchical policy architecture: a non-overridable global safety policy establishes immutable boundaries for critical risks (e.g., child safety, violent extremism), while user-defined policies can introduce domain-specific (non-global) risk categories and specify label-to-action behaviors to improve utility in real-world deployment settings. The framework decomposes safety decisions into structured Classify$\rightarrow$Act paths that route queries to the appropriate action (comply, guide, or reject) and render the decision-making process transparent. Extensive experiments demonstrate that PACT achieves near state-of-the-art safety performance under global policy evaluation while attaining the best controllability under user-specific policy evaluation, effectively mitigating the safety-helpfulness trade-off. We will release the PACT model suite, training data, and evaluation protocols to facilitate reproducible research in controllable safety alignment.

Jianfeng Si, Lin Sun, Weihong Lin, Xiangzheng Zhang• 2026

Related benchmarks

Task	Dataset	Result
Safety Evaluation	WildJailbreak	--	70
Safety Evaluation	S-Eval (base)	Safety Score98.2	9
Safety Evaluation	S-Eval attack	Safety Score91.9	9
Safety Evaluation	JBB-Behaviors	Safety Score98.7	9
Runtime Controllability	CoSApien (test)	Allow: G+/S83.3	5
Runtime Controllability	PACT (test)	Comp G+/S99.4	5

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord