Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HarmBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Jailbreak AttackHarmBench
Attack Success Rate (ASR)100
557
Red-TeamingHarmBench
ASR96.3
244
Jailbreak AttackHarmBench (test)
ASRHB99.73
212
Safety EvaluationHarmBench
ASR0
148
Safety EvaluationHarmbench
Harmbench Score0.06
127
Response Harmfulness DetectionHarmBench
F1 Score98.94
100
Jailbreak DefenseHarmBench
PAIR ASR0
91
Safety AlignmentHarmBench
ASR0
88
Unsafe RobustnessHarmBench
Unsafe Rate1
72
Jailbreak RobustnessHarmBench
HarmBench ASR0
72
JailbreakingHarmBench
Attack Success Rate (ASR)82.3
68
Multimodal Jailbreak AttackHarmBench
ASR0
62
Safety Alignment Breaking PreventionHarmBench
Harmful Score (%)0
60
Jailbreak Attack Success RateHarmBench
Attack Success Rate (Generated)96
52
Harmful Prompt RefusalHarmBench
ASR0
52
JailbreakingHARMBENCH 159 standard behaviors (test)
ASR0
51
JailbreakHarmBench
Toxicity Score1.01
50
JailbreakHarmBench Standard Behaviours (200 examples)
ASR0
48
Jailbreak AttackHarmBench-191 (dev)
Attack Success Rate (ASR)97.4
42
Refusal Ablation and Jailbreak Attack SuccessHARMBENCH
Attack Success Rate (ASR)96.27
40
ControllabilityHarmBench
HarmBench Score87.5
40
Safety EvaluationHarmbench
ASR3.5
39
Transferable Adversarial AttackHarmBench Classifier (test)
TASR@188.6
37
JailbreakHarmBench (HB) (standard split)
ASR@190.57
33
Safety EvaluationHarmBench
MD95
32
Showing 25 of 148 rows