Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

WildGuard

Benchmarks

Task NameDataset NameSOTA ResultTrend
Response Harmfulness ClassificationWildGuard (test)
F1 (Total)79.48
30
Safety EvaluationWildguard (test)
Wildguard Test Score0.08
27
Input ModerationWildGuard (test)
F1 Score89.5
22
Prompt ClassificationWildGuard Text Prompt
F1 Score90.46
14
Refusal DetectionWILDGUARD (test)
F1 (Harmful)94
14
Text-based safety moderationWildGuard
F1 Score78.6
12
Prompt Harmfulness ClassificationWILDGUARD (test)
F1 (Total)88.9
12
Output ModerationWildGuard (test)
F1 Score79.5
11
Jailbreak AttackWildGuard (test)
ASR82.64
8
Refusal EvaluationWildGuard Harmful
Refusal Rate84.35
7
Over-refusal EvaluationWildGuard Unharmful
Over-refusal Rate1.06
7
Safety ModerationWildGuard Prompt
F1 Score89.5
7
Audio Safety ModerationWildGuard-TTS
F1 Score88.4
7
SCAV-Embedding Attack DefenseWildguard (test)
ASR28.84
4
Showing 14 of 14 rows