Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

AlignGuard: Scalable Safety Alignment for Text-to-Image Generation

About

Text-to-image (T2I) models are widespread, but their limited safety guardrails expose end users to harmful content and potentially allow for model misuse. Current safety measures are typically limited to text-based filtering or concept removal strategies, able to remove just a few concepts from the model's generative capabilities. In this work, we introduce AlignGuard, a method for safety alignment of T2I models. We enable the application of Direct Preference Optimization (DPO) for safety purposes in T2I models by synthetically generating a dataset of harmful and safe image-text pairs, which we call CoProV2. Using a custom DPO strategy and this dataset, we train safety experts, in the form of low-rank adaptation (LoRA) matrices, able to guide the generation process away from specific safety-related concepts. Then, we merge the experts into a single LoRA using a novel merging strategy for optimal scaling performance. This expert-based approach enables scalability, allowing us to remove 7x more harmful concepts from T2I models compared to baselines. AlignGuard consistently outperforms the state-of-the-art on many benchmarks and establishes new practices for safety alignment in T2I networks. Code and data will be shared at https://safetydpo.github.io/.

Runtao Liu, I Chieh Chen, Jindong Gu, Jipeng Zhang, Renjie Pi, Qifeng Chen, Philip Torr, Ashkan Khakzar, Fabio Pizzati• 2024

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationCOCO
FID49.64
51
Safe Text-to-Image GenerationI2P
Inappropriate Probability8
23
Safe Text-to-Image GenerationCoPro V2 (test)
IP12
23
Safe Text-to-Image GenerationUnsafe Diffusion (UD)
IP Score17
23
Safe Text-to-Image GenerationCOCO 3K
FID37.54
23
Text-to-Image SafetyCoPro v2
Harmful Rate4.2
18
Text-to-Image SafetyT2VSafetyBench
Harmful Rate0.24
18
Text-to-Image SafetyUD
Harmful Rate16.7
18
Text-to-Image GenerationI2P
Harmful Rate0.162
9
Text-to-Image SafetyI2P
Harmful Rate13.7
9
Showing 10 of 10 rows

Other info

Follow for update