Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Jailbreak

Benchmarks

Task NameDataset NameSOTA ResultTrend
Jailbreak Attack Resistance EvaluationJailbreak
MMLU62.3
40
Jailbreak DetectionJailbreak data (70/30 stratified)
AUC100
32
ClassificationJAILBREAK (test)
Accuracy98.8
32
JailbreakJailbreak
MMLU65.5
20
Prompt Injection DefenseJailbreak MI-FGSM
Attack Success Rate4
12
Prompt Injection DefenseJailbreak APGD
ASR1
12
Jailbreak EvaluationJailBreak R1
Attack Success Rate (ASR)1.3
12
Safety EvaluationJailBreak-R1
LRM-JailBreak Score0
12
SteeringJailbreak
Steering Success82.5
11
Jailbreak RobustnessJailbreak Cipher, CodeChameleon (test)
Cipher Success Rate95.75
10
LLM Red-teamingJailbreak R1-defended Target Model
UA87.67
9
Adversarial RobustnessJailbreak Structured-based
Perplexity2.62
9
Adversarial RobustnessJailbreak Perturbation-based
Perplexity2.58
9
JailbreakingJailBreak
LG4 ASR1
8
OOD DetectionJailbreak (test)
Length-Matched AUROC85.8
5
Jailbreak resistance evaluationJailbreak augmented examples
Not Unsafe Rate100
4
Calibration AnalysisJailbreak
AUROC90
2
Jailbreak DetectionJailbreak V28K
Accuracy99.98
2
Showing 18 of 18 rows