Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FORTRESS

Benchmarks

Task NameDataset NameSOTA ResultTrend
Jailbreaking Safety EvaluationFortress
Safety Score87.7
30
Jailbreak Attack DefenseFORTRESS
ASR9.8
24
Harmful Content DetectionFortress
ASR18.6
12
Safety EvaluationFortress
JailBreak Score2.8
12
Overrefusal EvaluationFortress OR
Helpfulness Score97.6
12
Non-Agentic Performance EvaluationFortress (test)
Mean Score78.75
4
Safety EvaluationFortress
Cost per Accuracy Point ($)0.0016
4
Showing 7 of 7 rows