| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Safety Alignment | PKU-SafeRLHF 30K (IID) | WR89.26 | 36 | |
| Safety Evaluation | PKU-SafeRLHF-V | Accuracy77.8 | 20 | |
| Safety alignment | PKU-SafeRLHF | Gold Reward3.92 | 14 | |
| Safety Alignment | PKU-SafeRLHF in-distribution (test) | Accuracy (EN)99.44 | 10 | |
| Harmfulness Evaluation | PKU-SafeRLHF | Beaver-7B-Cost Score-1.11 | 10 | |
| Privacy Violation Detection | PKU-SafeRLHF | Acc87.5 | 9 | |
| Preference Evaluation | PKU-SafeRLHF | Win Rate57 | 8 | |
| LLM Alignment | PKU-SafeRLHF 2024 (test) | Win Rate0.58 | 4 | |
| Open-ended Dialogue | PKU-SafeRLHF OOD | Win Rate67.8 | 4 | |
| Preference Alignment | PKU-SafeRLHF (test) | Win Rate28.69 | 3 | |
| Mental Manipulation Detection | PKU-SafeRLHF | Accuracy80 | 3 | |
| Safety Alignment | PKU-SafeRLHF (test) | RM Safety Accuracy69.92 | 3 | |
| Malicious Goal Attack (Longer Token Generation) | PKU-SafeRLHF (test) | RM Length Accuracy50.17 | 3 | |
| Alignment Task Evaluation | PKU-SafeRLHF w/o trigger | RM Safety Acc70.09 | 3 | |
| Alignment Task Evaluation | PKU-SafeRLHF w/ trigger | RM Safety Acc70.97 | 3 | |
| Malicious Goal Evaluation | PKU-SafeRLHF w/o trigger | RM Length Accuracy44.32 | 3 | |
| Malicious Goal Evaluation | PKU-SafeRLHF w/ trigger | RM Length Acc64.82 | 3 | |
| Insulting Behavior Detection | PKU-SafeRLHF | Accuracy78 | 1 | |
| Discriminatory Behaviour Detection | PKU-SafeRLHF | Accuracy96 | 1 |