Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment

About

Existing paradigms for ensuring AI safety, such as guardrail models and alignment training, often compromise either inference efficiency or development flexibility. We introduce Disentangled Safety Adapters (DSA), a novel framework addressing these challenges by decoupling safety-specific computations from a task-optimized base model. DSA utilizes lightweight adapters that leverage the base model's internal representations, enabling diverse and flexible safety functionalities with minimal impact on inference cost. Empirically, DSA-based safety guardrails substantially outperform comparably sized standalone models across hate speech classification, detecting unsafe model inputs and responses, and hallucination detection with relative improvements of up to 53% in AUC. Furthermore, DSA-based safety alignment allows dynamic, inference-time adjustment of alignment strength and a fine-grained trade-off between instruction following performance and model safety. Importantly, combining the DSA safety guardrail with DSA safety alignment facilitates context-dependent alignment strength, boosting safety on StrongREJECT by 93% while maintaining 98% performance on MTBench - a total reduction in alignment tax of 8 percentage points compared to standard safety alignment fine-tuning. Overall, DSA presents a promising path towards more modular, efficient, and adaptable AI safety and alignment.

Kundan Krishna, Joseph Y Cheng, Charles Maalouf, Leon A Gatys• 2025

Related benchmarks

TaskDatasetResultRank
General Knowledge EvaluationMMLU
MMLU Accuracy50.15
127
Multi-turn Conversation EvaluationMT-Bench
MT-Bench Score7.07
68
Hate speech classificationToxiGen (test)
AUC99
24
Safety ClassificationAEGIS 2.0 (test)
AUC94
24
Hallucination DetectionSummedits (test)
AUC90
24
Safety ClassificationBeaverTails (test)
AUC93
24
Safety AlignmentStrongREJECT--
18
Safety Alignment (Jailbreak Resistance)JailbreakBench
JailbreakBench Score74
4
Showing 8 of 8 rows

Other info

Follow for update