Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

JailbreakBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Jailbreak AttackJailbreakBench
ASR100
242
Jailbreak AttackJailbreakBench
ASR@100
132
Jailbreak RobustnessJailbreakBench
Harmbench ASR0
72
Unsafe RobustnessJailbreakBench
Unsafe Rate0
72
Jailbreak AttackJailbreakBench (JBB)
ASR0
62
Safety EvaluationJailbreakBench (JBB) (test)
ASR (Llama-Guard-3-8B)1.12
56
JailbreakingJailbreakBench
Attack Success Rate (ASR)2
53
Jailbreak AttackJailbreakBench (JBB) (test)
Attack Success Rate (ASR)98
42
Jailbreak AttackJailbreakBench
Attack Success Rate (ASR)96
40
Jailbreak AttackJailbreakBench
ASR197
39
JailbreakJailbreakBench (original split)
ASR@195.15
33
Jailbreak DefenseJailbreakBench
ASR (GCG)0
30
Jailbreak AttackJailbreakBench
ASR91
27
Thinking Collapse AnalysisJailbreakBench (JBB)
Thinking Collapse Rate0
25
JailbreakingJailbreakBench
ASR (Detoxify)0
20
Jailbreak DefenseJailbreakBench
Rate of Response Safety70
20
Adversarial and Jailbreaking Attack DetectionJailbreakBench
AUROC0.8622
20
Jailbreak AttackJailbreakBench
Llama2 7B Attack Success Rate77
18
Red-teaming Attack Success RateJailbreakBench (test)
ASR (Vicuna)82
18
Jailbreak SafetyJailbreakBench
Reasoning Harmful Ratio0.3
17
Safety EvaluationJailbreakBench
Harmful Rate0
16
Jailbreak AttackJailbreakBench PAIR
Attack Success Rate (ASR)82
15
Jailbreak AttackJailBreakBench
JSR0.9
14
Jailbreak AttackJailbreakBench
ASR81
12
Adversarial AttackJailBreakBench non-trivial subset held-out prompts (Llama-2-7B: 280)
ASR26
12
Showing 25 of 52 rows