AlignGuard: Scalable Safety Alignment for Text-to-Image Generation
About
Text-to-image (T2I) models are widespread, but their limited safety guardrails expose end users to harmful content and potentially allow for model misuse. Current safety measures are typically limited to text-based filtering or concept removal strategies, able to remove just a few concepts from the model's generative capabilities. In this work, we introduce AlignGuard, a method for safety alignment of T2I models. We enable the application of Direct Preference Optimization (DPO) for safety purposes in T2I models by synthetically generating a dataset of harmful and safe image-text pairs, which we call CoProV2. Using a custom DPO strategy and this dataset, we train safety experts, in the form of low-rank adaptation (LoRA) matrices, able to guide the generation process away from specific safety-related concepts. Then, we merge the experts into a single LoRA using a novel merging strategy for optimal scaling performance. This expert-based approach enables scalability, allowing us to remove 7x more harmful concepts from T2I models compared to baselines. AlignGuard consistently outperforms the state-of-the-art on many benchmarks and establishes new practices for safety alignment in T2I networks. Code and data will be shared at https://safetydpo.github.io/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Generation | COCO | FID49.64 | 51 | |
| Safe Text-to-Image Generation | I2P | Inappropriate Probability8 | 23 | |
| Safe Text-to-Image Generation | CoPro V2 (test) | IP12 | 23 | |
| Safe Text-to-Image Generation | Unsafe Diffusion (UD) | IP Score17 | 23 | |
| Safe Text-to-Image Generation | COCO 3K | FID37.54 | 23 | |
| Text-to-Image Safety | CoPro v2 | Harmful Rate4.2 | 18 | |
| Text-to-Image Safety | T2VSafetyBench | Harmful Rate0.24 | 18 | |
| Text-to-Image Safety | UD | Harmful Rate16.7 | 18 | |
| Text-to-Image Generation | I2P | Harmful Rate0.162 | 9 | |
| Text-to-Image Safety | I2P | Harmful Rate13.7 | 9 |