Harmful Content Detection on Standard Harmful Content Datasets (Goal Hijacking Attack)

96Phishing

GAVEL

Updated 3mo ago

Evaluation Results

Method	Links
GAVEL 2026.01		96	89	87	100	99	99	86	100	90
GPT4 2026.01		55	70	49	90	63	91	28	48	12