Red Teaming

Benchmarks

Dataset Name	SOTA Method	Metric
HarmBench	AutoDAN-Turbo	ASR96.3	244	3mo ago
Violence prompts	UnlearnDiffAtk	Failure Rate (FR)0	48	2mo ago
I2P Nudity prompts	Ring-A-Bell	Failure Rate (FR)0	48	2mo ago
50 harmful goals (Manual evaluation)	PAIR	Hard ASR100	30	4mo ago
CatQA	SafeTransformer	ASR0	20	4mo ago
AdversarialQA	SafeTransformer	ASR0	20	4mo ago
AdvBench current (full)	OTTER-MLM	ASR (%)87.3	12	1mo ago
Religious Discrimination principle v1 (test)	QCI	Mean Best Category Score5.32	12	4mo ago
Illegal Activity principle v1 (test)		Mean Score (Best Category)-2.73	12	4mo ago
AI Supremacy principle v1 (test)	CRL	Mean Best Category Score11.7	12	4mo ago
AdvBench (test)	AMIS	ASR88	8	3mo ago
DailyDialog against DialoGPT-large	BRT (e+r)	RSR40	8	4mo ago
DailyDialog against BB-3B	BRT (e+r)	RSR40.2	8	4mo ago
ConvAI2 (filtered hard positive)	BRT (e+r)	RSR2,120	7	4mo ago
Bloom ZS (filtered hard positive)	BRT (e+r)	RSR15.6	7	4mo ago
BAD Against Friend Chat (test)	BRT (e)	RSR64.2	7	4mo ago
BAD Against Marv (test)	BRT (s+r)	RSR88.1	7	4mo ago
GPT-OSS 20B	PAIR	Coverage63.2	5	2mo ago
Llama-3-8B	Ours (ME)	Coverage63.04	5	2mo ago
Web-Augmented LLM Red-Teaming Evaluation Set	CREST-Search	Detection Rate80.5	5	3mo ago
Korean red teaming dataset (test)	Exaone-3.5-2.4B-inst	Attack Success Rate0.5797	5	4mo ago
HarmBench Claude-Sonnet-3.5 (held-out test)	AGENTICRED	ASR60	5	4mo ago
HarmBench Llama-3-8B (test)	AGENTICRED	ASR0.98	5	4mo ago
HarmBench Llama-2-7B (test)	AutoDAN-Turbo	ASR36	5	4mo ago
GPT-5 Mini	Ours (ME)	Coverage72.32	4	2mo ago

Showing 25 of 38 rows