SOTA Safety Alignment benchmarks and papers with code

Benchmarks

Dataset Name	SOTA Method	Metric
HarmBench		ASR0	88	4mo ago
Salad Bench	ShaPO-T	MD0.68	68	4mo ago
HH-RLHF	ShaPO-T	MD Rate1.09	68	4mo ago
Do-Not-Answer	ShaPO-R	MD0	52	4mo ago
WildJailbreak		Trainable parameters (M)15,768.31	44	4mo ago
Visual Adversarial Attacks		ASR43.1	40	3mo ago
JOOD	MoRAS	ASR0	40	3mo ago
SORRY-Bench	LED-Merging	ASR10.22	40	2mo ago
PKU-SafeRLHF 30K (IID)	ShaPO-T	WR89.26	36	4mo ago
AdvBench	SEA	Reward-0.38	32	4mo ago
Harmful Dataset (test)		Harmful Score81	30	4mo ago
BeaverTails V	SaFeR-ToolKit (+ SFT+GRPO) [3B]	Safety Score93.37	27	1mo ago
AdvBench	BoN64	Harm Rate0	25	1mo ago
WildJailbreak	R1 - 8B + UnsafeChain full	Safe@177.2	24	3mo ago
Safety Benchmarks (Sorry-bench, StrongREJECT, WildJailbreak, JBB-PAIR, JBB-GCG)	SafeChain	Average Score42.34	21	2mo ago
XSTest	Yi-VL-6B	Compliance95.2	21	2mo ago
HEx-PHI	DiaBlo	HEx-PHI Score98.8	18	2mo ago
HarmBench	SFT	MD Score95	18	4mo ago
Average (Do-Not-Answer, HarmBench, HH-RLHF, Salad Bench)	ShaPO-T	Aggregate Score0.59	18	4mo ago
StrongReject	R1 - 7B + UnsafeChain full	Safe@158	18	2mo ago
VSA Violence (visual synonym attack prompts)	AEGIS	ASR0	16	17d ago
RAB Violence (adversarial prompts)	AEGIS	ASR1	16	17d ago
I2P Violence (explicit prompts)	SDID	ASR1	16	17d ago
SORRY-Bench		Score85.7	14	18d ago
PKU-SafeRLHF	PPO	Gold Reward3.92	14	4mo ago

Showing 25 of 60 rows