Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Wildjailbreak

Benchmarks

Task NameDataset NameSOTA ResultTrend
Safety EvaluationWildJailbreak (held-out)
Attack Success Rate (ASR)0
50
Safety AlignmentWildJailbreak
Trainable parameters (M)15,768.31
44
Over-refusalWildjailbreak (Benign)
Wildjailbreak Benign Refusal Rate28.4
42
SafetyWildJailbreak
Harmful Response Ratio17.6
21
Safety EvaluationWildJailbreak
Helpfulness Score (H)0.8592
20
Safety ModerationWILDJAILBREAK (val)
ASR0.7
18
Jailbreak DetectionWildjailbreak
F1 Score96
15
Safety PerformanceWildJailbreak
Selective Refusal Score (Δs)90.1
11
Showing 8 of 8 rows