Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ToxicChat

Benchmarks

Task NameDataset NameSOTA ResultTrend
Performance EstimationToxicChat
MAE0
198
Toxicity DetectionToxicChat
F1 Score1
45
Safety ClassificationToxicChat (test)
Accuracy97.3
43
Input ModerationToxicChat (test)
F1 Score82.8
42
Unsafe Prompt DetectionToxicChat (test)
Precision0.815
16
Safety RefusalToxicChat
Refusal Rate95
15
Prompt ClassificationToxicChat Text Prompt
F1 Score96.27
14
Safety ClassificationToxicChat
F1 Score0.81
14
Binary safety classificationToxicChat jailbreaking
Macro F170.54
11
Safety ClassificationToxicChat (out-of-distribution)
F1 Score72.88
11
Prompt-only Safety RoutingToxicChat
Routing F156.82
10
Content Safety ClassificationToxicChat
Precision75.46
6
Safety DetectionToxicChat (held-out)
AUROC87.7
5
OOD DetectionToxicChat (test)
Length-Matched AUROC60.2
5
Toxicity DetectionToxicChat (test)
Accuracy0.9772
4
Unsafe prompt detectionToxicChat
AUPRC75.5
4
Safety DetectionToxicChat
F1 Score65
3
Calibration AnalysisToxicChat
AUROC0.67
2
Safety ClassificationToxicChat (in-distribution)
F1 Score (%)82.2
2
Showing 19 of 19 rows