Harmful Content Detection

Benchmarks

Dataset Name	SOTA Method	Metric
UnsafeBench	Yuvion VL	AUPRC89.5	61	1mo ago
HoliSafe-Bench	Qwen3-VL 8B	AUPRC75.6	49	2mo ago
PHEME New Attacks: ExplainDrive (test)	LLM-SGA/ARHOCD	Accuracy82.91	15	4mo ago
WG, XS, AEGIS 2, ToxicChat, OAI Mod (test)		F1 Score83.6	12	29d ago
Fortress	SInternal	ASR18.6	12	2mo ago
PHEME Known Attacks: DeepWordBug, TFAdjusted, TREPAT (test)	LLM-SGA/ARHOCD	Accuracy85.59	10	4mo ago
ELF-HP	Perspective API	Accuracy48.57	8	3mo ago
ELF22	Perspective API	Accuracy43.96	8	3mo ago
CADD	Perspective API	Accuracy90.19	8	3mo ago
COVID-HATE	Perspective API	Accuracy96.4	8	3mo ago
MT-CONAN	LlamaGuard-1	Accuracy97.24	8	3mo ago
CONAN	LlamaGuard-1	Accuracy98.47	8	3mo ago
Qian-Reddit		Accuracy97.09	8	3mo ago
Qian-Gab		Accuracy99.06	8	3mo ago
FoodGuardBench (test)	FoodGuard-4B	FNR2.75	7	3mo ago
BeaverTails Harmful (held-out target labels)	Quotient Transfer	AUROC0.793	6	2mo ago
ToxicChat	Llama Guard 3	Safe Rate92.6	4	18d ago
Trolling-oriented generations DeepSeek-Llama 70B	Perspective API	Accuracy16.24	4	3mo ago
Trolling-oriented generations Llama-3.1 70B		Accuracy26.04	4	3mo ago
Trolling-oriented generations GPT-4o	Perspective API	Accuracy19.88	4	3mo ago
CADD DeepSeek generations	Perspective API	Accuracy60.13	4	3mo ago
CADD Llama-3.1 generations	Perspective API	Accuracy69.18	4	3mo ago
CADD GPT-4o generations	Perspective API	Accuracy64.84	4	3mo ago
Ours trolling-oriented synthetic	Perspective API	Accuracy19.88	4	3mo ago
Ours CADD-based synthetic	Perspective API	Accuracy65.55	4	3mo ago

Showing 25 of 28 rows