| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Stereotype | DR-IRL | Refusal Rate99.03 | 20 | 1mo ago | |
| XsTest | Self-Rewarding | Refusal Rate99 | 20 | 1mo ago | |
| Toxigen | Toxigen (%)100 | 17 | 1mo ago | ||
| PKU-SafeRLHF-30K | SafeDPO | Win Rate87.25 | 6 | 1mo ago | |
| GPT-4 Evaluation Template T2 (overall) | SafeDPO | Win Rate89.99 | 5 | 1mo ago | |
| Template T3 GPT-4 evaluation (test) | SafeDPO | Win Rate87.5 | 5 | 1mo ago | |
| HH-RLHF (test) | MAVIS | Reward2.772 | 4 | 1mo ago |