AlignGuard: Scalable Safety Alignment for Text-to-Image Generation

About

Text-to-image (T2I) models are widespread, but their limited safety guardrails expose end users to harmful content and potentially allow for model misuse. Current safety measures are typically limited to text-based filtering or concept removal strategies, able to remove just a few concepts from the model's generative capabilities. In this work, we introduce AlignGuard, a method for safety alignment of T2I models. We enable the application of Direct Preference Optimization (DPO) for safety purposes in T2I models by synthetically generating a dataset of harmful and safe image-text pairs, which we call CoProV2. Using a custom DPO strategy and this dataset, we train safety experts, in the form of low-rank adaptation (LoRA) matrices, able to guide the generation process away from specific safety-related concepts. Then, we merge the experts into a single LoRA using a novel merging strategy for optimal scaling performance. This expert-based approach enables scalability, allowing us to remove 7x more harmful concepts from T2I models compared to baselines. AlignGuard consistently outperforms the state-of-the-art on many benchmarks and establishes new practices for safety alignment in T2I networks. Code and data will be shared at https://safetydpo.github.io/.

Runtao Liu, I Chieh Chen, Jindong Gu, Jipeng Zhang, Renjie Pi, Qifeng Chen, Philip Torr, Ashkan Khakzar, Fabio Pizzati• 2024

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	COCO	FID49.64	79
Text-to-Image Generation	COCO 30k	FID20.8	63
Nudity Detection	I2P	Breast (F) Detections88	29
Safe Text-to-Image Generation	I2P	Inappropriate Probability8	23
Safe Text-to-Image Generation	CoPro V2 (test)	IP12	23
Safe Text-to-Image Generation	Unsafe Diffusion (UD)	IP Score17	23
Safe Text-to-Image Generation	COCO 3K	FID37.54	23
Broad-concept removal	I2P	Self-harm Removal Rate33.33	22
Text-to-Image Safety	CoPro v2	Harmful Rate4.2	18
Text-to-Image Safety	T2VSafetyBench	Harmful Rate0.24	18

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord