Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Palette: A Modular, Controllable, and Efficient Framework for On-demand Authorized Safety Alignment Relaxation in LLMs

About

Current safety alignment of foundation models largely follows a \emph{one-size-fits-all} paradigm, applying the same refusal policy across users and contexts. As a result, models may refuse requests that are unsafe for general users but legitimate for authorized professionals, limiting helpfulness in specialized professional settings. Existing approaches either require costly realignment or rely on inference-time steering that suffers from imprecise control and added latency. To this end, we propose \textsc{Palette}, a modular, controllable, and efficient framework that selectively relaxes refusal behavior on authorized target domains while preserving standard safety elsewhere. Our method identifies a refusal direction via multi-objective search and internalizes it into the model through lightweight adaptation. \textsc{Palette} further supports modular composition: it learns domain-specific safety controls independently and composes them through parameter merging, enabling on-demand multi-domain authorization without retraining. Experiments across four safety benchmarks, multiple model variants, and both LLMs and VLMs show that \textsc{Palette} delivers precise safety control without sacrificing general utility, offering a practical path toward foundation models that adapt to diverse professional needs.

Qitao Tan, Xiaoying Song, Arman Akbari, Arash Akbari, Yanzhi Wang, Xiaoming Zhai, Lingzi Hong, Zhen Xiang, Jin Lu, Geng Yuan• 2026

Related benchmarks

TaskDatasetResultRank
Multimodal UnderstandingMMBench--
847
Mathematical ReasoningGSM8K
Accuracy (Acc)74.9
337
General KnowledgeMMLU
MMLU General Knowledge Accuracy74.8
307
Mathematical ReasoningGSM8K
GSM8K Accuracy (%)88.5
204
Language UnderstandingMMLU
MMLU Accuracy70.1
132
General Knowledge EvaluationMMLU
MMLU Accuracy68.8
127
Safety EvaluationMM-SafetyBench--
98
Safety ControllabilityGenHarm
Violence Refusal Rate97.6
80
Multimodal UnderstandingMMMU
Accuracy (MMMU)52
52
Safety ControllabilitySafety Domains
Refusal Rate (Violence)100
21
Showing 10 of 15 rows

Other info

Follow for update