Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

BeaverTails

Benchmarks

Task NameDataset NameSOTA ResultTrend
Harmful question-answeringBeaverTails HarmfulQA (1k and 10k samples)
Avg Harmfulness Score0
63
Malicious Fine-tuning DefenseBeaverTails (test)
Harmfulness Score1
44
Harmful score evaluationBeaverTails (test)
Harmful Score28.4
36
Multi-label content safety classificationBeavertails
F1 Score0.86
35
Harmlessness evaluationBeavertails
Helpful Score58.4
33
SafetyBeavertails
Violation Rate1.2
32
Response ClassificationBeaverTails V Text-Image Response
F1 Score84.02
23
Adversarial and Jailbreaking Attack DetectionBeaverTails
AUROC0.8525
20
Text-based safety moderationBeaverTails
F1 Score87.3
19
Response Harmfulness DetectionBeaverTails
F1 Score89.9
18
Safety EvaluationBeaverTails Text
Overall Score99.3
16
Safety EvaluationBeaverTails Audio 1K
RSR98.95
12
Prompt ClassificationBeaverTails-V Text-Image Prompt
F1 Score88.36
7
Unsafe content categorizationBeaverTails V
Accuracy63.44
6
LLM PersonalizationBeaverTails (test)
Personalization Win Rate72.1
6
Out-of-Taxonomy Risk DetectionBeaverTails V
F1 Score0.6314
4
Safety EvaluationBeavertails-A Audio
Safety Score0.997
4
LLM Safety DefenseBeaverTails
ASR10.39
4
Safety Reasoning EvaluationBeaverTails 5,000 prompts (subsampled)
Relevance4.68
2
OOD safety category inference (Stage 2)BeaverTails V
Reward Mean-
0
Showing 20 of 20 rows