LlavaGuard: An Open VLM-based Framework for Safeguarding Vision Datasets and Models
About
This paper introduces LlavaGuard, a suite of VLM-based vision safeguards that address the critical need for reliable guardrails in the era of large-scale data and models. To this end, we establish a novel open framework, describing a customizable safety taxonomy, data preprocessing, augmentation, and training setup. For teaching a VLM safeguard on safety, we further create a multimodal safety dataset with high-quality human expert annotations, where each image is labeled with a safety rating, category, and rationale. We also employ advanced augmentations to support context-specific assessments. The resulting LlavaGuard models, ranging from 0.5B to 7B, serve as a versatile tool for evaluating the safety compliance of visual content against flexible policies. In comprehensive experiments, LlavaGuard outperforms both state-of-the-art safeguards and VLMs in accuracy and in flexibly handling different policies. Additionally, we demonstrate LlavaGuard's performance in two real-world applications: large-scale dataset annotation and moderation of text-to-image models. We make our entire framework, including the dataset, model weights, and training code.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Safety Evaluation | UnsafeBench | F1 Score63.4 | 24 | |
| Unsafe content detection | LlavaGuard | Accuracy82 | 14 | |
| Safety Evaluation | SMID (test) | F1 Score66.6 | 11 | |
| Safety Evaluation | UnsafeDiff (test) | F1 Score53 | 11 | |
| Safety Evaluation | UnsafeBench (test) | F1 Score53.7 | 11 | |
| Severity-wise Harmfulness Classification | BLM-Guard | Accuracy (High)73.3 | 9 | |
| Binary Harmfulness Detection | BLM-Guard | NR (B)82.3 | 9 | |
| Jailbreak Detection | FigStep | AUROC0.836 | 9 | |
| Unsafe content detection | VLGuard | F1 Score69.8 | 9 | |
| Jailbreak Detection | JailBreakV | AUROC84.26 | 9 |