| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| HEx-PHI | ID-LoRA | HEx-PHI Score97.2 | 162 | 1mo ago | |
| Advbench | AOA | Safety Score100 | 117 | 4d ago | |
| Harmbench | NPO | Harmbench Score0.06 | 112 | 8d ago | |
| DoNotAnswer Framed | TFS-IP-CoT | HRR0 | 96 | 1mo ago | |
| Sorry-Bench | IDGAF | Safety Score99.09 | 90 | 1mo ago | |
| Harmfulness Evaluation Sequences | llama2-13b-chat | Harmfulness Score0.79 | 84 | 1mo ago | |
| ToxiGen | VCL | Safety100 | 77 | 1mo ago | |
| MultiJail | Qwen3-4B | Safe Response Rate99 | 66 | 1mo ago | |
| StrongReject | DirectRefusal | Attack Success Rate0.64 | 65 | 4d ago | |
| LLaMA-2-7B-CHAT Safety (test) | TRAP | Safety Score0.55 | 60 | 1mo ago | |
| MM-Safety | MoRAS | ASR0.4 | 57 | 18d ago | |
| WildJailbreak | SET | ASR0.101 | 53 | 4d ago | |
| WildJailbreak (held-out) | NeST | Attack Success Rate (ASR)0 | 50 | 1mo ago | |
| HarmBench | SET | ASR4.1 | 42 | 4d ago | |
| MM-SafetyBench | RAI | Average ASR0 | 42 | 1mo ago | |
| SPA-VL | AutoSteer | ASR0.2 | 40 | 19d ago | |
| Harmful Prompts | Surgery | Harmful Score8.3 | 40 | 1mo ago | |
| CocoNot | GRAPH ROUTER | Safety Score0.613 | 36 | 1mo ago | |
| FigStep | MoRAS | ASR0.6 | 32 | 19d ago | |
| HarmBench | SFT | MD95 | 32 | 1mo ago | |
| AdvBench 50 examples | Safe Response Rate100 | 32 | 1mo ago | ||
| XSTest (test) | DPO + OGPSA | XSTest Score95 | 32 | 1mo ago | |
| AdvBench | Post-hoc (LlamaGuard) | Overall Safety Score100 | 30 | 1mo ago | |
| PS-Bench base setting (test) | ASR (Hate Speech)18 | 30 | 1mo ago | ||
| Harmful and Jailbreak datasets | Harm-1 Score16 | 28 | 10d ago |