Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Adversarial and Jailbreaking Attack Detection on HarmBench

0.8887AUROC

LLAMAGUARD3-1B-LOGITS

0.3148280.4638140.61280.761786Feb 4, 2026
Updated 4d ago

Evaluation Results

MethodLinks
2026.02
0.88870.36
2026.02
0.88681
2026.02
0.88571
2026.02
0.86421
2026.02
0.85950.645
2026.02
0.81850.965
2026.02
0.81020.585
2026.02
0.80070.955
2026.02
0.7980.905
2026.02
0.75780.67
2026.02
0.72470.95
2026.02
0.63420.92
2026.02
0.49040.99
2026.02
0.48461
2026.02
0.47990.965
2026.02
0.43410.99
2026.02
0.41630.99
2026.02
0.40230.985
2026.02
0.36481
2026.02
0.33691