ShieldGemma 2: Robust and Tractable Image Content Moderation

About

We introduce ShieldGemma 2, a 4B parameter image content moderation model built on Gemma 3. This model provides robust safety risk predictions across the following key harm categories: Sexually Explicit, Violence \& Gore, and Dangerous Content for synthetic images (e.g. output of any image generation model) and natural images (e.g. any image input to a Vision-Language Model). We evaluated on both internal and external benchmarks to demonstrate state-of-the-art performance compared to LlavaGuard \citep{helff2024llavaguard}, GPT-4o mini \citep{hurst2024gpt}, and the base Gemma 3 model \citep{gemma_2025} based on our policies. Additionally, we present a novel adversarial data generation pipeline which enables a controlled, diverse, and robust image generation. ShieldGemma 2 provides an open image moderation tool to advance multimodal safety and responsible AI development.

Wenjun Zeng, Dana Kurniawan, Ryan Mullins, Yuchi Liu, Tamoghna Saha, Dirichi Ike-Njoku, Jindong Gu, Yiwen Song, Cai Xu, Jingjing Zhou, Aparna Joshi, Shravan Dheep, Mani Malek, Hamid Palangi, Joon Baek, Rick Pereira, Karthik Narasimhan• 2025

Related benchmarks

Task	Dataset	Result
Tag Detection	SenBen MECD tags 1.0 (test)	F1 Tag8.9	11
Content Moderation	UnsafeBench Sexual category (test)	Accuracy64.8	8
Jailbreak Defense	Safety Guardrail Evaluation Set	Char Noise Robustness24	6
Multimodal Content Moderation	UnsafeBench Sexual Text-Only	Accuracy59.09	3
Multimodal Content Moderation	UnsafeBench Sexual Text+Visual	Accuracy54.86	3

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord