Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

BeaverTails

Benchmarks

Task NameDataset NameSOTA ResultTrend
Harmful question-answeringBeaverTails HarmfulQA (1k and 10k samples)
Avg Harmfulness Score0
63
Text-based safety moderationBeaverTails
F1 Score87.3
46
Malicious Fine-tuning DefenseBeaverTails (test)
Harmfulness Score1
44
Response ClassificationBeaverTails V Text-Image Response
F1 Score84.8
39
Harmful score evaluationBeaverTails (test)
Harmful Score28.4
36
Multi-label content safety classificationBeavertails
F1 Score0.86
35
Harmlessness evaluationBeavertails
Helpful Score58.4
33
SafetyBeavertails
Violation Rate1.2
32
Multimodal Safety EvaluationBeaverTails-V
Safety Score2.9
22
Safety EvaluationBeavertails-V (test)
Helpfulness Score86.93
20
Safety EvaluationBeaverTails Evaluation
Harmful Score (HS)1.24
20
Adversarial and Jailbreaking Attack DetectionBeaverTails
AUROC0.8525
20
Response Harmfulness DetectionBeaverTails
F1 Score89.9
18
Safety EvaluationBeaverTails Text
Overall Score99.3
16
Safety EvaluationBeaverTails (test)
Blue Reward (Higher)98
15
Safety AlignmentBeaverTails V
Safety Score93.37
13
Safety EvaluationBeaverTails Audio 1K
RSR98.95
12
TracingBeavertails (test)
Tracing Success Rate (TSR)99.29
10
Prompt ClassificationBeaverTails-V Text-Image Prompt
F1 Score88.36
7
Unsafe content categorizationBeaverTails V
Accuracy63.44
6
LLM PersonalizationBeaverTails (test)
Personalization Win Rate72.1
6
Safety MonitoringBeaverTails
ASR @ 1% FPR19.6
5
Safety EvaluationBeaverTails
Safety Score76.7
4
Out-of-Taxonomy Risk DetectionBeaverTails V
F1 Score0.6314
4
Safety EvaluationBeavertails-A Audio
Safety Score0.997
4
Showing 25 of 28 rows