LlavaGuard: An Open VLM-based Framework for Safeguarding Vision Datasets and Models
About
This paper introduces LlavaGuard, a suite of VLM-based vision safeguards that address the critical need for reliable guardrails in the era of large-scale data and models. To this end, we establish a novel open framework, describing a customizable safety taxonomy, data preprocessing, augmentation, and training setup. For teaching a VLM safeguard on safety, we further create a multimodal safety dataset with high-quality human expert annotations, where each image is labeled with a safety rating, category, and rationale. We also employ advanced augmentations to support context-specific assessments. The resulting LlavaGuard models, ranging from 0.5B to 7B, serve as a versatile tool for evaluating the safety compliance of visual content against flexible policies. In comprehensive experiments, LlavaGuard outperforms both state-of-the-art safeguards and VLMs in accuracy and in flexibly handling different policies. Additionally, we demonstrate LlavaGuard's performance in two real-world applications: large-scale dataset annotation and moderation of text-to-image models. We make our entire framework, including the dataset, model weights, and training code.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Safety Evaluation | MM-SafetyBench | Average ASR32.58 | 98 | |
| Safety Evaluation | VLGuard | ASR90.42 | 27 | |
| Safety Evaluation | JailBreakV | ASR90.71 | 27 | |
| Safety Evaluation | UnsafeBench | F1 Score63.4 | 24 | |
| Unsafe content detection | LlavaGuard | Accuracy82 | 14 | |
| Video Moderation | SafeWatch-GenAI (test) | Sexual Accuracy96.6 | 14 | |
| Content Moderation | SafeWatch-Real (val) | Sexual Accuracy88.2 | 14 | |
| Binary Safety Classification | UnsafeBench | Sexual51.2 | 13 | |
| Safety Evaluation | SMID (test) | F1 Score66.6 | 11 | |
| Safety Classification | SafeEditBench | Policy L1 Success Rate49.15 | 11 |