Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Aegis

Benchmarks

Task NameDataset NameSOTA ResultTrend
Response ClassificationAegis Text Response 2.0
F1 Score86.3
32
Prompt classificationAegis 2.0
F1 Score87.3
32
Prompt classificationAegis
F1 Score89.6
32
Explainability classificationAEGIS 2.0 (test)
Unsafe F179.6
27
Input ModerationAEGIS (test)
F1 Score91.4
26
Harmfulness DetectionAegis
Macro F189.78
25
Safety classificationAEGIS 2.0 (test)
AUC94
24
Input ModerationAEGIS 2.0 (test)
F1 Score87.9
22
Content ModerationAegis 2.0
F1 Score86.1
21
Safety ModerationAegis 2.0
Prompt F186.4
15
Prompt ClassificationAegis Text Prompt 2.0
F1 Score83.52
14
Text-based safety moderationAegis
F1 Score84
12
Self-Harm Risk ScreeningAEGIS (N=161) 2.0
Escalation Rate (Esc.)100
10
Unsafe content categorizationAegis 2.0
Accuracy63.18
9
Harmfulness DetectionAegis 2.0
Macro F183.4
8
Safety ModerationAegis Response 2.0
F1 Score87.6
7
Safety ModerationAegis Prompt 2.0
F1 Score87.9
7
Safety ModerationAegis Prompt 1.0
F1 Score91.9
7
Multi-label Safety Categorizationaegis categories
Macro Accuracy62.84
4
Out-of-Taxonomy Risk DetectionAegis 2.0
F1 Score63.62
4
OOD safety category inference (Stage 2)Aegis 2.0
Reward Mean23.08
4
Content ModerationAegis In-Distribution
Pornography Score75
2
Showing 22 of 22 rows