Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Wildjailbreak

Benchmarks

Task NameDataset NameSOTA ResultTrend
Safety EvaluationWildJailbreak
ASR0.101
53
Safety EvaluationWildJailbreak (held-out)
Attack Success Rate (ASR)0
50
Over-refusalWildjailbreak (Benign)
Wildjailbreak Benign Refusal Rate1.43
49
Safety AlignmentWildJailbreak
Trainable parameters (M)15,768.31
44
Safety AlignmentWildJailbreak
Safe@177.2
24
SafetyWildJailbreak
Harmful Response Ratio17.6
21
Safety ModerationWILDJAILBREAK (val)
ASR0.7
18
Jailbreak SafetyWildJailbreak
Reasoning Harmful Ratio17.3
17
Jailbreak DetectionWildjailbreak
F1 Score96
15
Safety PerformanceWildJailbreak
Selective Refusal Score (Δs)90.1
11
Refusal EvaluationWildJailbreak Adversarial Harmful
Refusal Rate89.45
7
JailbreakWildJailbreak Forbidden Questions (Overall)
ASR92.1
2
Showing 12 of 12 rows