| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Helpfulness alignment | Anthropic hh-rlhf | Gold Reward3.36 | 14 | |
| Preference Alignment | Anthropic-hh-rlhf (test) | LLM-as-a-Judge Helpful Score5.83 | 12 | |
| LLM Alignment | Anthropic HH-RLHF 2022 (test) | Win Rate62 | 4 | |
| Preference Learning | Anthropic HH-RLHF+VI Preference (test) | Overall Accuracy64 | 3 |