| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Safety Alignment | PKU-SafeRLHF 30K (IID) | WR89.26 | 36 | |
| Safety Alignment Evaluation | PKU-SafeRLHF 30K (test) | Win Rate (WR)90.23 | 32 | |
| Human Preference Alignment | PKU-SafeRLHF | BLEU0.324 | 31 | |
| Safety Evaluation | PKU-SafeRLHF-V | Accuracy77.8 | 20 | |
| Reward Modeling | PKU-SafeRLHF (test) | MAE0.0871 | 19 | |
| Safety alignment | PKU-SafeRLHF | Gold Reward3.92 | 14 | |
| LLM Alignment | PKU-SafeRLHF | BWR (Median)49 | 12 | |
| Best-of-N Alignment | PKU-SafeRLHF | Percent batches with BWR > 0.5038 | 12 | |
| Safety Alignment | PKU-SafeRLHF in-distribution (test) | Accuracy (EN)99.44 | 10 | |
| Harmfulness Evaluation | PKU-SafeRLHF | Beaver-7B-Cost Score-1.11 | 10 | |
| Privacy Violation Detection | PKU-SafeRLHF | Acc87.5 | 9 | |
| Preference Evaluation | PKU-SafeRLHF | Win Rate57 | 8 | |
| Safe RLHF Alignment | PKU-SafeRLHF 30K | Helpfulness6.51 | 7 | |
| Helpfulness | PKU-SafeRLHF 30K | Win Rate84.5 | 6 | |
| Harmlessness | PKU-SafeRLHF-30K | Win Rate87.25 | 6 | |
| LLM Alignment | PKU-SafeRLHF 2024 (test) | Win Rate0.58 | 4 | |
| Open-ended Dialogue | PKU-SafeRLHF OOD | Win Rate67.8 | 4 | |
| Preference Alignment | PKU-SafeRLHF (test) | Win Rate28.69 | 3 | |
| Mental Manipulation Detection | PKU-SafeRLHF | Accuracy80 | 3 | |
| Safety Alignment | PKU-SafeRLHF (test) | RM Safety Accuracy69.92 | 3 | |
| Malicious Goal Attack (Longer Token Generation) | PKU-SafeRLHF (test) | RM Length Accuracy50.17 | 3 | |
| Alignment Task Evaluation | PKU-SafeRLHF w/o trigger | RM Safety Acc70.09 | 3 | |
| Alignment Task Evaluation | PKU-SafeRLHF w/ trigger | RM Safety Acc70.97 | 3 | |
| Malicious Goal Evaluation | PKU-SafeRLHF w/o trigger | RM Length Accuracy44.32 | 3 | |
| Malicious Goal Evaluation | PKU-SafeRLHF w/ trigger | RM Length Acc64.82 | 3 |