Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

BeaverTails

Benchmarks

Task NameDataset NameSOTA ResultTrend
Safety EvaluationBeaverTails (test)
Harmful Score7.9
110
Harmfulness DetectionBeaverTails bottom 30% uncertainty slice (test)
AUROC85.1
70
Safe-or-harmful binary classificationBeaverTails
Accuracy84.6
63
Multi-risk safety monitoringBeaverTails
Accuracy (%)80.1
63
Harmful question-answeringBeaverTails HarmfulQA (1k and 10k samples)
Avg Harmfulness Score0
63
Response Harmfulness DetectionBeaverTails
F1 Score89.9
59
Harmful score evaluationBeaverTails (test)
Harmful Score28.4
52
Text-based safety moderationBeaverTails
F1 Score87.3
46
Malicious Fine-tuning DefenseBeaverTails (test)
Harmfulness Score1
44
Response ClassificationBeaverTails V Text-Image Response
F1 Score84.8
39
Multi-label content safety classificationBeavertails
F1 Score0.86
35
Harmlessness evaluationBeavertails
Helpful Score58.4
33
SafetyBeavertails
Violation Rate1.2
32
Adversarial Attack RobustnessBeaverTails
Attack Success Rate0
24
Value AlignmentBeaverTails (test)
Value Alignment Score59.9
24
Safety classificationBeaverTails (test)
AUC94
24
Multimodal Safety EvaluationBeaverTails-V
Safety Score2.9
22
Safety EvaluationBeavertails-V (test)
Helpfulness Score86.93
20
Safety EvaluationBeaverTails Evaluation
Harmful Score (HS)1.24
20
Adversarial and Jailbreaking Attack DetectionBeaverTails
AUROC0.8525
20
Safety EvaluationBeaverTails
ASR8.8
19
LLM Safety EvaluationBeaverTails
Ssafe Score9.88
18
Safety and Helpfulness EvaluationBeavertails
Safety Score92.03
18
Safety EvaluationBeaverTails Text
Overall Score99.3
16
Safety EvaluationBeaverTails (test)
Blue Reward (Higher)98
15
Showing 25 of 55 rows