Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Adversarial and Jailbreaking Attack Detection on HarmBench

0.8887AUROC

LLAMAGUARD3-1B-LOGITS

0.3148280.4638140.61280.761786Feb 4, 2026
Updated 1mo ago

Evaluation Results

MethodLinks
2026.02
0.88870.36
2026.02
0.88681
2026.02
0.88571
2026.02
0.86421
2026.02
0.85950.645
2026.02
0.81850.965
2026.02
0.81020.585
2026.02
0.80070.955
2026.02
0.7980.905
2026.02
0.75780.67
2026.02
0.72470.95
2026.02
0.63420.92
2026.02
0.49040.99
2026.02
0.48461
2026.02
0.47990.965
2026.02
0.43410.99
2026.02
0.41630.99
2026.02
0.40230.985
2026.02
0.36481
2026.02
0.33691