Our new X account is live! Follow @wizwand_team for updates
Search any
task
Feedback
Search any
task
SOTA Safety Evaluation benchmarks and papers with code | Wizwand
Our new X account is live! Follow @wizwand_team for updates
Home
/
Tasks
Safety Evaluation
Benchmarks
Dataset Name
SOTA Method
Dataset Name
SOTA Method
Metric
Trend
Results
Last Updated
HEx-PHI
ID-LoRA
HEx-PHI Score
97.2
148
3d ago
Advbench
AOA
Safety Score
100
117
3d ago
DoNotAnswer Framed
TFS-IP-CoT
HRR
0
96
3d ago
Sorry-Bench
IDGAF
Safety Score
99.09
90
3d ago
Harmfulness Evaluation Sequences
llama2-13b-chat
Harmfulness Score
0.79
84
3d ago
Harmbench
NPO
Harmbench Score
0.06
76
2d ago
ToxiGen
Self-Improving Pretraining
Safety
93.1
71
3d ago
LLaMA-2-7B-CHAT Safety (test)
TRAP
Safety Score
0.55
60
3d ago
WildJailbreak (held-out)
NeST
Attack Success Rate (ASR)
0
50
3d ago
StrongReject
DirectRefusal
Attack Success Rate
0.64
45
3d ago
MM-SafetyBench
RAI
Average ASR
0
42
3d ago
Harmful Prompts
Surgery
Harmful Score
8.3
40
3d ago
CocoNot
GRAPH ROUTER
Safety Score
0.613
36
3d ago
AdvBench 50 examples
Direct Instruction
Safe Response Rate
100
32
3d ago
XSTest (test)
DPO + OGPSA
XSTest Score
95
32
3d ago
PS-Bench base setting (test)
Stateless
ASR (Hate Speech)
18
30
3d ago
Safety Evaluation Suite HarmBench, StrongReject, WildJailbreak, XSTest
Initial
HarmBench Score
68.44
28
3d ago
Sorry-Bench base
Base
Safety Score
92.73
27
3d ago
Safety
Olmo 3.1 32B Instruct
Score
92.1
27
3d ago
Wildguard (test)
WHP (W/O safe set)
Wildguard Test Score
0.08
27
3d ago
SafetyBench en
ROSE
Avg Score
81.2
25
3d ago
UnsafeBench
BLPO
F1 Score
89
24
3d ago
Harmful Benchmarks (CATQA, HEX-PHI, Salad-Base)
Safe LORA
CATQA Score
99.94
24
3d ago
HH-RedTeaming
Llama-2 Chat
H.R.adv
0.036
22
3d ago
HH-RLHF (test)
SAILS
Harm Score
1.02
21
3d ago
Showing 25 of 162 rows
25 / page
50 / page
100 / page
1
2
3
4
5
6
7
Search any
task
Search any
task
Terms of Service
FAQs