Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Aegis

Benchmarks

Task NameDataset NameSOTA ResultTrend
Response ClassificationAegis Text Response 2.0
F1 Score86.3
32
Prompt classificationAegis 2.0
F1 Score87.3
32
Prompt classificationAegis
F1 Score89.6
32
Explainability classificationAEGIS 2.0 (test)
Unsafe F179.6
27
Input ModerationAEGIS (test)
F1 Score91.4
26
Input ModerationAEGIS 2.0 (test)
F1 Score87.9
22
Content ModerationAegis 2.0
F1 Score86.1
21
Prompt ClassificationAegis Text Prompt 2.0
F1 Score83.52
14
Text-based safety moderationAegis
F1 Score84
12
Unsafe content categorizationAegis 2.0
Accuracy63.18
9
Safety ModerationAegis Response 2.0
F1 Score87.6
7
Safety ModerationAegis Prompt 2.0
F1 Score87.9
7
Safety ModerationAegis Prompt 1.0
F1 Score91.9
7
Out-of-Taxonomy Risk DetectionAegis 2.0
F1 Score63.62
4
OOD safety category inference (Stage 2)Aegis 2.0
Reward Mean23.08
4
Content ModerationAegis In-Distribution
Pornography Score75
2
Showing 16 of 16 rows