Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

JBB-Behaviors

Benchmarks

Task NameDataset NameSOTA ResultTrend
Jailbreak DefenseJBB-Behaviors
ASR0
121
JailbreakJBB-Behaviors utilitarian dilemmas (test)
Jailbreak Success Rate87
72
Jailbreak AttackJBB-Behaviors
Rule-Judge Score100
56
JailbreakingJBB Behaviors
ASR100
35
Jailbreak AttackJBB Behaviors
ASR100
35
JailbreakingJBB-Behaviors (test)
ASR (GPT-4o)99
27
Refusal mechanism ablationJBB-Behaviors 100 harmful prompts (test)
Refusal Rate100
24
Jailbreak RobustnessJBB-Behaviors (test)
ASR0
24
LLM JailbreakingJBB-Behaviors Scenario J3
Hypervolume0.707
21
LLM JailbreakingJBB-Behaviors Scenario J2
Hypervolume0.691
21
LLM JailbreakingJBB-Behaviors Scenario J1
Hypervolume59.1
21
Robustness against priming vulnerabilityJBB-Behaviors (test)
ASR (Guardrail Model)0
20
Jailbreak Attack RobustnessJBB-Behaviors
ASR (PAIR)10
18
Jailbreak RobustnessJBB-Behaviors
ASR (PAIR, Guardrail Model)0.3
18
JailbreakJBB-Behaviors
ASR (GPT-4o)99.2
12
Jailbreak AttackJBB-Behaviors GPT-4o
Attack Success Rate (ASR)93
11
Jailbreak AttackJBB-Behaviors GPT-3.5-Turbo
ASR98
11
Jailbreak AttackJBB-Behaviors DeepSeek-R1-14B
ASR98
11
Jailbreak AttackJBB-Behaviors Qwen2.5-14B
ASR98
11
Jailbreak AttackJBB-Behaviors Llama-3.3-70B
Attack Success Rate (ASR)99
11
Safety EvaluationJBB-Behaviors
Safety Score99.3
9
Safety DetectionJBB-Behaviors (held-out)
AUROC93.7
5
Safety EvaluationJBB-Behaviors
Unsafe Interaction Rate0
3
Jailbreak DetectionJBB-Behaviors Benign
Accuracy99.96
1
Jailbreak DetectionJBB-Behaviors Harmful
Accuracy99.94
1
Showing 25 of 25 rows