Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

StrongREJECT

Benchmarks

Task NameDataset NameSOTA ResultTrend
Jailbreak AttackStrongREJECT
Attack Success Rate99.4
262
Safety EvaluationStrongReject
Attack Success Rate0
77
Jailbreak RobustnessStrongREJECT
Mean Harmful Score0
71
Safety Alignment Breaking PreventionStrongREJECT
Harmful Score (%)0
60
Jailbreak DefenseStrongReject
Attack Success Rate1.5
54
Red-teaming Safety EvaluationStrongReject
ASR4
53
Jailbreak RobustnessStrongReject
Direct Attack Rate67
30
Multi-turn JailbreakingStrongReject (test)
ASR0.34
30
Safety and Helpfulness EvaluationStrongREJECT
Harm Rate0.2
29
JailbreakingStrongReject (test)
ASR (GPT-4o)96
27
Adversarial AttackStrongREJECT Original (test)
CHR46
27
Adversarial AttackStrongREJECT Hijacked (test)
CHR0
27
Safety EvaluationStrongReject
H Score62
22
Safety EvaluationStrongReject
Safety Score97
21
Backdoor detectionStrongREJECT prompts with triggers
TPR100
20
JailbreakingStrongREJECT
ASR (Detoxify)0
20
Harmful Content SafetyStrongReject (SR)
Evaluation Score (avg@4)100
18
Backdoor Attack EvaluationStrongREJECT
ASR (w/ trigger)0.601
18
Safety AlignmentStrongReject
Safe@158
18
Safety EvaluationStrongReject (SR)
Reasoning Harmful Ratio16.7
17
Safety ModerationStrongReject
F1 Score100
15
Safety Alignment EvaluationStrongReject SR-PAPL
Safety Rate100
14
Safety Alignment EvaluationStrongReject SR-PAPA
Safety Rate100
14
Safety Alignment EvaluationStrongReject SR-PAP_M
Safety Rate100
14
Safety Alignment EvaluationStrongReject SR-Pair
Safety Rate98.72
14
Showing 25 of 48 rows