Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ShieldGemma 2: Robust and Tractable Image Content Moderation

About

We introduce ShieldGemma 2, a 4B parameter image content moderation model built on Gemma 3. This model provides robust safety risk predictions across the following key harm categories: Sexually Explicit, Violence \& Gore, and Dangerous Content for synthetic images (e.g. output of any image generation model) and natural images (e.g. any image input to a Vision-Language Model). We evaluated on both internal and external benchmarks to demonstrate state-of-the-art performance compared to LlavaGuard \citep{helff2024llavaguard}, GPT-4o mini \citep{hurst2024gpt}, and the base Gemma 3 model \citep{gemma_2025} based on our policies. Additionally, we present a novel adversarial data generation pipeline which enables a controlled, diverse, and robust image generation. ShieldGemma 2 provides an open image moderation tool to advance multimodal safety and responsible AI development.

Wenjun Zeng, Dana Kurniawan, Ryan Mullins, Yuchi Liu, Tamoghna Saha, Dirichi Ike-Njoku, Jindong Gu, Yiwen Song, Cai Xu, Jingjing Zhou, Aparna Joshi, Shravan Dheep, Mani Malek, Hamid Palangi, Joon Baek, Rick Pereira, Karthik Narasimhan• 2025

Related benchmarks

TaskDatasetResultRank
Tag DetectionSenBen MECD tags 1.0 (test)
F1 Tag8.9
11
Content ModerationUnsafeBench Sexual category (test)
Accuracy64.8
8
Jailbreak DefenseSafety Guardrail Evaluation Set
Char Noise Robustness24
6
Multimodal Content ModerationUnsafeBench Sexual Text-Only
Accuracy59.09
3
Multimodal Content ModerationUnsafeBench Sexual Text+Visual
Accuracy54.86
3
Showing 5 of 5 rows

Other info

Follow for update