Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MaliciousInstruct

Benchmarks

Task NameDataset NameSOTA ResultTrend
Jailbreak AttackMaliciousInstruct
ASR100
161
Visual Jailbreaking AttackMaliciousInstruct
ASR92
61
JailbreakingMaliciousInstruct (test)
BERT Score4.63
32
Adversarial and Jailbreaking Attack DetectionMaliciousInstruct
AUROC0.8825
20
JailbreakingMaliciousInstruct
BERT Score4.8
11
Jailbreak AttackMaliciousInstruct (test)
ASR (Refusal)95
10
Jailbreak AttackMaliciousInstruct
ASR (Google Banana 2)62
8
Jailbreak AttackMaliciousInstruct 41 (test)
ASR0.935
6
Showing 8 of 8 rows