Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

OpenAI Moderation

Benchmarks

Task NameDataset NameSOTA ResultTrend
Harmfulness DetectionOpenAI Moderation
Macro F1 Score92.9
45
Safety ClassificationOpenAI-moderation (test)
Accuracy74.88
23
Prompt ClassificationOpenAI Moderation Text Prompt
F1 Score88.89
14
Unsafe content categorizationOpenAI Moderation
Accuracy88.35
9
Multi-label Safety CategorizationOpenAI Moderation
Macro Accuracy47.67
4
Out-of-Taxonomy Risk DetectionOpenAI Moderation
Out-of-Taxonomy F167.92
4
OOD safety category inference (Stage 2)OpenAI Moderation
Mean Reward36.45
4
JailbreakingOpenAI’s Moderation
Bypass Rate100
1
Showing 8 of 8 rows