Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Safety Evaluation Set

Benchmarks

Task NameDataset NameSOTA ResultTrend
LLM SafetySafety Evaluation Set
Harmful Response Rate1.66
25
Content ModerationSafety Evaluation Set Moderation (held-out target labels)
AUROC0.89
6
Sentiment AnalysisSafety Evaluation Set Sentiment (held-out target labels)
AUROC97.5
6
Jailbreaking DetectionSafety Evaluation Set Jailbreaking (held-out target labels)
AUROC97.4
6
Toxicity DetectionSafety Evaluation Set Toxicity (held-out target labels)
AUROC97.6
6
Showing 5 of 5 rows