Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

HarmBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Jailbreak AttackHarmBench
Attack Success Rate (ASR)100
376
Safety AlignmentHarmBench
ASR0
88
Safety EvaluationHarmbench
Harmbench Score0.06
76
JailbreakingHARMBENCH 159 standard behaviors (test)
ASR0
51
JailbreakHarmBench Standard Behaviours (200 examples)
ASR0
48
ControllabilityHarmBench
HarmBench Score87.5
40
Transferable Adversarial AttackHarmBench Classifier (test)
TASR@188.6
37
Red-teaming Safety EvaluationHarmbench
ASR0.3
32
Adversarial Risk EstimationHarmBench (test)
ASR@1000100
24
Response Harmfulness DetectionHarmBench
F1 Score87.61
23
Harmfulness EvaluationHarmBench
Harmful Response Ratio21.26
21
Adversarial and Jailbreaking Attack DetectionHarmBench
AUROC0.8887
20
JailbreakingHarmBench 51 (test)
ASR@5 (Standard)98.1
19
Safety AlignmentHarmBench
MD Score95
18
Jailbreak Attack EvaluationHarmBench (400 random samples)
ASR0
18
Coherence evaluationHarmBench (test)
N-gram Repetition Rate0.22
16
Malicious Prompt RefusalHarmBench
Refusal Rate96
15
Prompt ClassificationHarmBench Text Prompt
F1 Score98.85
14
Jailbreak AttackHarmbench Malicious (full)
Harmful Score1.13
14
Safety ClassificationHarmBench
Recall100
14
Adversarial RobustnessHarmBench
DR56.25
12
Prompt-Response Safety RoutingHarmBench
Routing F155.92
10
Safety ClassificationHarmBench (test)
F1 Score90.5
9
Safety EvaluationHarmBench Hades
Safety Score (1-ASR)1
8
Safety EvaluationHarmBench-QR
Safety Score (1-ASR)0.9975
8
Showing 25 of 48 rows