Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HarM

Benchmarks

Task NameDataset NameSOTA ResultTrend
Hateful meme classificationHarM (test)
AUC91.03
31
Jailbreak Attempt ForgettingHarm Jailbreak 2
ASR71.4
28
Jailbreak Attempt ForgettingJB-1 Jailbreak Harm-1
ASR (%)73.1
28
Harmful Question ForgettingHarm-2 GPTFUZZER WildAttack
Attack Success Rate (ASR)0
28
Harmful Question ForgettingHarm-1 GPTFUZZER WildAttack
ASR61
28
Scrubbing AttackHarm
AUC80
20
Spoofing Attack DetectionHarm
WCS8.933
18
Harmful Meme DetectionHarM
Accuracy83.82
13
Hateful Meme DetectionHarM
AUC90.25
12
Harmful meme detectionHarm-C (test)
Accuracy87
10
Showing 10 of 10 rows