| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Llama-Guard | IPO | Harmfulness (%)82.14 | 36 | 3mo ago | |
| PKU-SafeRLHF 30K (test) | wDPO | Win Rate (WR)90.23 | 32 | 2mo ago | |
| HEx-PHI | Harmful Response Rate0.7 | 18 | 7d ago | ||
| SorryBench | Staged-Competence | Harmful Response Rate (%)4.2 | 18 | 7d ago | |
| Refusal Evaluation Dataset | Refusal Rate99 | 16 | 3mo ago | ||
| WildJailbreak (WildJB) | Stair-DPO | Safety Rate98.6 | 14 | 1d ago | |
| Strata | MESA | Safety Rate99 | 14 | 1d ago | |
| StrongReject SR-PAPL | MESA | Safety Rate100 | 14 | 1d ago | |
| StrongReject SR-PAPA | MESA | Safety Rate100 | 14 | 1d ago | |
| StrongReject SR-PAP_M | MESA | Safety Rate100 | 14 | 1d ago | |
| StrongReject SR-Pair | Stair-DPO | Safety Rate98.72 | 14 | 1d ago | |
| StrongReject SR-base | Safety Rate100 | 14 | 1d ago | ||
| OOD Safety Suite Average of SorryBench, AdvBench, and HEx-PHI | Sqrt-Competence | Average Absolute Improvement0.5 | 12 | 7d ago | |
| RTA | Utility53 | 9 | 19d ago | ||
| LATharm | OpenAI Moderation | Utility54 | 9 | 19d ago | |
| Safety Evaluation Dataset | DCR | Response Safe Rate (Llama Guard Model)81 | 5 | 3mo ago |