Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ProGuard: Towards Proactive Multimodal Safeguard

About

The rapid evolution of generative models has led to a continuous emergence of multimodal safety risks, exposing the limitations of existing defense methods. To address these challenges, we propose ProGuard, a vision-language proactive guard that identifies and describes out-of-distribution (OOD) safety risks without the need for model adjustments required by traditional reactive approaches. We first construct a modality-balanced dataset of 87K samples, each annotated with both binary safety labels and risk categories under a hierarchical multimodal safety taxonomy, effectively mitigating modality bias and ensuring consistent moderation across text, image, and text-image inputs. Based on this dataset, we train our vision-language base model purely through reinforcement learning (RL) to achieve efficient and concise reasoning. To approximate proactive safety scenarios in a controlled setting, we further introduce an OOD safety category inference task and augment the RL objective with a synonym-bank-based similarity reward that encourages the model to generate concise descriptions for unseen unsafe categories. Experimental results show that ProGuard achieves performance comparable to closed-source large models on binary safety classification, substantially outperforms existing open-source guard models on unsafe content categorization. Most notably, ProGuard delivers a strong proactive moderation ability, improving OOD risk detection by 52.6% and OOD risk description by 64.8%.

Shaohan Yu, Lijun Li, Chenyang Si, Lu Sheng, Jing Shao• 2025

Related benchmarks

TaskDatasetResultRank
Response ClassificationBeaverTails V Text-Image Response
F1 Score83.1
23
Response ClassificationAegis Text Response 2.0
F1 Score82.27
16
Response ClassificationWild Guard Text Response
F1 Score92.92
16
Response ClassificationXSTest Text Response
F1 Score95.94
16
Prompt ClassificationAegis Text Prompt 2.0
F1 Score83.38
14
Prompt ClassificationToxicChat Text Prompt
F1 Score96.07
14
Prompt ClassificationWildGuard Text Prompt
F1 Score86.44
14
Prompt ClassificationSimple SafetyTest Text Prompt
F1 Score99.5
14
Prompt ClassificationXSTest Text Prompt
F1 Score88.25
14
Prompt ClassificationOpenAI Moderation Text Prompt
F1 Score83.05
14
Showing 10 of 33 rows

Other info

GitHub

Follow for update