Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

WildGuard

Benchmarks

Task NameDataset NameSOTA ResultTrend
Harmfulness DetectionWildGuard
Macro F1 Score90.6
47
Response Harmfulness ClassificationWildGuard (test)
F1 (Total)79.48
30
Safety EvaluationWildguard (test)
Wildguard Test Score0.08
27
Value AlignmentWildGuard (test)
Score29.69
24
Input ModerationWildGuard (test)
F1 Score89.5
22
Prompt Harmfulness ClassificationWILDGUARD (test)
F1 Score89.44
18
Safety classificationWildGuard (test)
F1 Score88.5
17
Safety EvaluationWildGuard
WildGuard Refusal Rate25.5
16
Prompt ClassificationWildGuard Text Prompt
F1 Score90.46
14
Refusal DetectionWILDGUARD (test)
F1 (Harmful)94
14
Text-based safety moderationWildGuard
F1 Score78.6
12
Binary safety classificationwildguard prompt safety
Macro F197.91
11
Output ModerationWildGuard (test)
F1 Score79.5
11
Stealthiness EvaluationWildGuard 7B
Mean Perplexity2.33
10
Streaming Safety DetectionWildGuard (test)
Det@183.45
8
Jailbreak AttackWildGuard (test)
ASR82.64
8
Violation DetectionWildGuard (test)
Safety F189.17
7
Refusal EvaluationWildGuard Harmful
Refusal Rate84.35
7
Over-refusal EvaluationWildGuard Unharmful
Over-refusal Rate1.06
7
Safety ModerationWildGuard Prompt
F1 Score89.5
7
Audio Safety ModerationWildGuard-TTS
F1 Score88.4
7
Multi-label Safety Categorizationwildguard prompt subcategory
Macro Accuracy83.35
4
SCAV-Embedding Attack DefenseWildguard (test)
ASR28.84
4
Safety ClassificationWildGuard overall (test)
Accuracy85.6
2
JailbreakingWildGuard 7B
Bypass Rate99.81
1
Showing 25 of 25 rows