ShieldGemma: Generative AI Content Moderation Based on Gemma
About
We present ShieldGemma, a comprehensive suite of LLM-based safety content moderation models built upon Gemma2. These models provide robust, state-of-the-art predictions of safety risks across key harm types (sexually explicit, dangerous content, harassment, hate speech) in both user input and LLM-generated output. By evaluating on both public and internal benchmarks, we demonstrate superior performance compared to existing models, such as Llama Guard (+10.8\% AU-PRC on public benchmarks) and WildCard (+4.3\%). Additionally, we present a novel LLM-based data curation pipeline, adaptable to a variety of safety-related tasks and beyond. We have shown strong generalization performance for model trained mainly on synthetic data. By releasing ShieldGemma, we provide a valuable resource to the research community, advancing LLM safety and enabling the creation of more effective content moderation solutions for developers.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Response Harmfulness Detection | XSTEST-RESP | Response Harmfulness F173.86 | 34 | |
| Safety Classification | SafeRLHF | F1 Score0.4707 | 32 | |
| Response Harmfulness Classification | WildGuard (test) | F1 (Total)47 | 30 | |
| Prompt Classification | SEA-SafeguardBench | AUPRC (Average)82.8 | 29 | |
| Text-based safety moderation | Toxic Chat | F1 Score78.4 | 24 | |
| Response Harmfulness Detection | HarmBench | F1 Score56.44 | 23 | |
| Response Classification | BeaverTails V Text-Image Response | F1 Score57 | 23 | |
| Trajectory-level safety evaluation | ASSE-Safety (test) | Accuracy47.2 | 20 | |
| Trajectory-level safety evaluation | ATBench (test) | Accuracy0.511 | 20 | |
| Trajectory-level safety evaluation | R-judge (test) | Accuracy47.7 | 20 |