ProGuard: Towards Proactive Multimodal Safeguard
About
The rapid evolution of generative models has led to a continuous emergence of multimodal safety risks, exposing the limitations of existing defense methods. To address these challenges, we propose ProGuard, a vision-language proactive guard that identifies and describes out-of-distribution (OOD) safety risks without the need for model adjustments required by traditional reactive approaches. We first construct a modality-balanced dataset of 87K samples, each annotated with both binary safety labels and risk categories under a hierarchical multimodal safety taxonomy, effectively mitigating modality bias and ensuring consistent moderation across text, image, and text-image inputs. Based on this dataset, we train our vision-language base model purely through reinforcement learning (RL) to achieve efficient and concise reasoning. To approximate proactive safety scenarios in a controlled setting, we further introduce an OOD safety category inference task and augment the RL objective with a synonym-bank-based similarity reward that encourages the model to generate concise descriptions for unseen unsafe categories. Experimental results show that ProGuard achieves performance comparable to closed-source large models on binary safety classification, substantially outperforms existing open-source guard models on unsafe content categorization. Most notably, ProGuard delivers a strong proactive moderation ability, improving OOD risk detection by 52.6% and OOD risk description by 64.8%.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Response Classification | BeaverTails V Text-Image Response | F1 Score83.1 | 23 | |
| Response Classification | Aegis Text Response 2.0 | F1 Score82.27 | 16 | |
| Response Classification | Wild Guard Text Response | F1 Score92.92 | 16 | |
| Response Classification | XSTest Text Response | F1 Score95.94 | 16 | |
| Prompt Classification | Aegis Text Prompt 2.0 | F1 Score83.38 | 14 | |
| Prompt Classification | ToxicChat Text Prompt | F1 Score96.07 | 14 | |
| Prompt Classification | WildGuard Text Prompt | F1 Score86.44 | 14 | |
| Prompt Classification | Simple SafetyTest Text Prompt | F1 Score99.5 | 14 | |
| Prompt Classification | XSTest Text Prompt | F1 Score88.25 | 14 | |
| Prompt Classification | OpenAI Moderation Text Prompt | F1 Score83.05 | 14 |