MOSAIC: Composable Safety Alignment with Modular Control Tokens
About
Safety alignment in large language models (LLMs) is commonly implemented as a single static policy embedded in model parameters. However, real-world deployments often require context-dependent safety rules that vary across users, regions, and applications. Existing approaches struggle to provide such conditional control: parameter-level alignment entangles safety behaviors with general capabilities, while prompt-based methods rely on natural language instructions that provide weak enforcement. We propose MOSAIC, a modular framework that enables compositional safety alignment through learnable control tokens optimized over a frozen backbone model. Each token represents a safety constraint and can be flexibly activated and composed at inference time. To train compositional tokens efficiently, we introduce order-based task sampling and a distribution-level alignment objective that mitigates over-refusal. Experiments show that MOSAIC achieves strong defense performance with substantially lower over-refusal while preserving model utility.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Safety Alignment | Safety Alignment Dataset 1-order (test) | DSR100 | 10 | |
| Safety Alignment | Safety Alignment Dataset 2-order (test) | DSR99.8 | 10 | |
| Safety Alignment | Safety Alignment Dataset 3-order (test) | DSR100 | 10 | |
| Safety Alignment | Safety Alignment Dataset 4-order (test) | DSR100 | 10 | |
| Language Understanding | MMLU 1-order | Accuracy55.1 | 10 | |
| Language Understanding | MMLU 2-order | Accuracy55.2 | 10 | |
| Language Understanding | MMLU 3-order | Accuracy55.1 | 10 |