Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Granite Guardian

About

We introduce the Granite Guardian models, a suite of safeguards designed to provide risk detection for prompts and responses, enabling safe and responsible use in combination with any large language model (LLM). These models offer comprehensive coverage across multiple risk dimensions, including social bias, profanity, violence, sexual content, unethical behavior, jailbreaking, and hallucination-related risks such as context relevance, groundedness, and answer relevance for retrieval-augmented generation (RAG). Trained on a unique dataset combining human annotations from diverse sources and synthetic data, Granite Guardian models address risks typically overlooked by traditional risk detection models, such as jailbreaks and RAG-specific issues. With AUC scores of 0.871 and 0.854 on harmful content and RAG-hallucination-related benchmarks respectively, Granite Guardian is the most generalizable and competitive model available in the space. Released as open-source, Granite Guardian aims to promote responsible AI development across the community. https://github.com/ibm-granite/granite-guardian

Inkit Padhi, Manish Nagireddy, Giandomenico Cornacchia, Subhajit Chaudhury, Tejaswini Pedapati, Pierre Dognin, Keerthiram Murugesan, Erik Miehling, Mart\'in Santill\'an Cooper, Kieran Fraser, Giulio Zizzo, Muhammad Zaid Hameed, Mark Purcell, Michael Desmond, Qian Pan, Zahra Ashktorab, Inge Vejsbjerg, Elizabeth M. Daly, Michael Hind, Werner Geyer, Ambrish Rawat, Kush R. Varshney, Prasanna Sattigeri• 2024

Related benchmarks

TaskDatasetResultRank
Multi-label content safety classificationBeavertails
F1 Score0.84
35
Safety ClassificationSafeRLHF
F1 Score0.92
32
Safety Risk Detectioninternal Agentic AI workflow benchmark
Precision98
29
Safety ClassificationWildGuardMix (test)--
27
Text-based safety moderationOpenAI
F1 Score77
26
Safety ClassificationXSTest (test)
F185.7
20
Safety ClassificationXSTest
F1 Score87
16
Unsafe Prompt DetectionToxicChat (test)
Precision0.423
16
Prompt injection detectionSafeGuardPI
F1 Score93
15
Adversarial Attack DetectionInTheWild
Recall87
15
Showing 10 of 30 rows

Other info

Follow for update