| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| HEx-PHI | ID-LoRA | HEx-PHI Score97.2 | 162 | 2mo ago | |
| HarmBench | SFT | ASR0 | 148 | 1d ago | |
| HexPhi | Harmfulness2 | 140 | 1mo ago | ||
| Harmbench | NPO | Harmbench Score0.06 | 127 | 5d ago | |
| Advbench | AOA | Safety Score100 | 117 | 2d ago | |
| BeaverTails (test) | SafeInstruct | Harmful Score7.9 | 110 | 8d ago | |
| MM-SafetyBench | RAI | Average ASR0 | 98 | 6d ago | |
| DoNotAnswer Framed | TFS-IP-CoT | HRR0 | 96 | 3mo ago | |
| Sorry-Bench | IDGAF | Safety Score99.09 | 90 | 3mo ago | |
| HEx-PHI | Attack Success Rate (ASR)5.17 | 87 | 1d ago | ||
| DirectHarm 4 | GradSafe | Attack Success Rate9 | 87 | 1d ago | |
| DirectHarm | Harmfulness Score5 | 84 | 1mo ago | ||
| Harmfulness Evaluation Sequences | llama2-13b-chat | Harmfulness Score0.79 | 84 | 3mo ago | |
| XSTest Unsafe | RealSafe-R1 | False Compliance Rate (FC)0 | 78 | 27d ago | |
| XSTest Safe | ReasoningGuard | FC4 | 78 | 27d ago | |
| StrongReject | STAR-1 | Attack Success Rate0 | 77 | 22d ago | |
| ToxiGen | VCL | Safety100 | 77 | 2mo ago | |
| WildJailbreak | SInternal | ASR0.068 | 70 | 22d ago | |
| MultiJail | Qwen3-4B | Safe Response Rate99 | 66 | 2mo ago | |
| LLaMA-2-7B-CHAT Safety (test) | TRAP | Safety Score0.55 | 60 | 3mo ago | |
| MM-Safety | MoRAS | ASR0.4 | 57 | 2mo ago | |
| Safety Suite AdvBench, PKU-SafeRLHF, HarmBench, JailbreakBench, SORRY-Bench, HarmfulQA, ALERT | DPO-Mix | AdvBench Score8.59 | 56 | 21h ago | |
| JailbreakBench (JBB) (test) | MLP | ASR (Llama-Guard-3-8B)1.12 | 56 | 27d ago | |
| Refusal Signal Score | MLP | ASR7.5 | 56 | 27d ago | |
| SecureBreak | MLP | ASR4.44 | 56 | 27d ago |