Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HarmBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Jailbreak AttackHarmBench
Attack Success Rate (ASR)100
487
Safety EvaluationHarmbench
Harmbench Score0.06
112
Safety AlignmentHarmBench
ASR0
88
JailbreakingHARMBENCH 159 standard behaviors (test)
ASR0
51
JailbreakHarmBench Standard Behaviours (200 examples)
ASR0
48
Safety EvaluationHarmBench
ASR4.1
42
Refusal Ablation and Jailbreak Attack SuccessHARMBENCH
Attack Success Rate (ASR)96.27
40
ControllabilityHarmBench
HarmBench Score87.5
40
Transferable Adversarial AttackHarmBench Classifier (test)
TASR@188.6
37
Safety EvaluationHarmBench
MD95
32
Red-teaming Safety EvaluationHarmbench
ASR0.3
32
JailbreakingHarmBench (test)
ASR (GPT-4o)97
27
Red Teaming AttackHarmBench (test)
ZS31.18
27
Safety ModerationHarmbench
F1 Score87.2
26
Safety EvaluationHarmBench Contextual Trajectory Evaluation Multi-turn
SFR96.1
24
Adversarial Risk EstimationHarmBench (test)
ASR@1000100
24
Response Harmfulness DetectionHarmBench
F1 Score87.61
23
Input ModerationHarmBench (test)
F1 Score100
22
Harmfulness EvaluationHarmBench
Harmful Response Ratio21.26
21
Adversarial and Jailbreaking Attack DetectionHarmBench
AUROC0.8887
20
JailbreakingHarmBench 51 (test)
ASR@5 (Standard)98.1
19
Safety AlignmentHarmBench
MD Score95
18
Jailbreak Attack EvaluationHarmBench (400 random samples)
ASR0
18
Jailbreak AttackHarmBench example-based Llama3 8B
Attack Success Rate5
17
Safety EvaluationHarmBench translated (test)
Success Rate (EN)11
16
Showing 25 of 83 rows