Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Wildjailbreak

Benchmarks

Task NameDataset NameSOTA ResultTrend
Jailbreak RobustnessWildJailbreak
Unsafe Rate0
144
Safety EvaluationWildJailbreak
ASR0.068
70
Safety EvaluationWildJailbreak (held-out)
Attack Success Rate (ASR)0
50
Over-refusalWildjailbreak (Benign)
Wildjailbreak Benign Refusal Rate1.43
49
Safety AlignmentWildJailbreak
Trainable parameters (M)15,768.31
44
JailbreakWildJailBreak (WJB) (test)
ASR@12
33
Safety AlignmentWildJailbreak
Safe@177.2
24
Jailbreak EvaluationWildJailbreak
Performance Rate98.5
22
SafetyWildJailbreak
Harmful Response Ratio17.6
21
Safety ModerationWILDJAILBREAK (val)
ASR0.7
18
Jailbreak SafetyWildJailbreak
Reasoning Harmful Ratio17.3
17
Jailbreak DetectionWildjailbreak
F1 Score96
15
Safety Alignment EvaluationWildJailbreak (WildJB)
Safety Rate98.6
14
Text ModerationWildJailbreak Adv. Benign n = 210
Flagged Count3
13
Text ModerationWildJailbreak Adv. Harmful n = 2,000
Flagged Count278
13
Adversarial RobustnessWildJailbreak 2k queries (test)
Number of Explicit Refusals498
12
Attacking Cascade System Decision MakerWildJailbreak
Performance98.5
11
JailbreakingWildJailbreak (WJB)
ASR@1 (Qwen2.5-7B-IT)89.5
11
Jailbreak AttackWildJailbreak
ASR0.8
11
Safety PerformanceWildJailbreak
Selective Refusal Score (Δs)90.1
11
Benign ComplianceWildJailBreak (WJB) Vanilla Benign
Compliance Rate100
9
Refusal EvaluationWildJailbreak Adversarial Harmful
Refusal Rate89.45
7
Jailbreaking Attack DetectionWildJailbreak
Accuracy (MCA)36
6
Safety DetectionWildJailbreak (held-out)
AUROC99
5
JailbreakWildJailbreak Forbidden Questions (Overall)
ASR92.1
2
Showing 25 of 25 rows