Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Aegis

Benchmarks

Task NameDataset NameSOTA ResultTrend
Explainability classificationAEGIS 2.0 (test)
Unsafe F179.6
27
Response ClassificationAegis Text Response 2.0
F1 Score82.27
16
Prompt classificationAegis 2.0
F1 Score87.3
16
Prompt classificationAegis
F1 Score89.6
16
Prompt ClassificationAegis Text Prompt 2.0
F1 Score83.52
14
Text-based safety moderationAegis
F1 Score84
12
Content ModerationAegis 2.0
F1 Score80
10
Unsafe content categorizationAegis 2.0
Accuracy63.18
9
Safety ModerationAegis Response 2.0
F1 Score87.6
7
Safety ModerationAegis Prompt 2.0
F1 Score87.9
7
Safety ModerationAegis Prompt 1.0
F1 Score91.9
7
Out-of-Taxonomy Risk DetectionAegis 2.0
F1 Score63.62
4
OOD safety category inference (Stage 2)Aegis 2.0
Reward Mean23.08
4
Showing 13 of 13 rows