| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Safety Classification | SafeRLHF | F1 Score0.94 | 32 | |
| Harmlessness preference labeling accuracy | SafeRLHF-RMB (test) | Bench Accuracy70.6 | 15 | |
| Reward Modeling | SafeRLHF Reversed | Accuracy88.4 | 9 | |
| Reward Modeling | SafeRLHF Standard | Accuracy89.8 | 9 | |
| Safety Alignment | SafeRLHF | Win Rate83 | 8 | |
| Safety Moderation | SafeRLHF | F1 Score69.9 | 7 | |
| Constitutional AI Alignment | SafeRLHF (test) | Likert Score (5-Point)4.652 | 6 | |
| Multi-objective RLHF alignment | safeRLHF (test) | Win Rate52 | 1 |