Share your thoughts, 1 month free Claude Pro on usSee more

Safety Classification on ToxicChat (out-of-distribution)

72.88F1 Score

Multi-head self-attn

Updated 5mo ago

Evaluation Results

Method	Links
Multi-head self-attn 2026.01		72.88	0.798
WildGuard 2026.01		70.8	-
Aegis-Guard-D 2026.01		70	-
GPT-4 2026.01		68.3	-
ShieldHead (Gemma2-27B) 2026.01		67.7	-
Scoring attention 2026.01		64.81	0.706
ShieldHead (Llama3.1-8B) 2026.01		64.3	-
Llama Guard 2026.01		61.6	0.626
OpenAI Moderation 2026.01		61.4	0.631
Direct pooling 2026.01		53.33	0.565
Llama-Guard2 2026.01		47.1	-