| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Safety Alignment | HH-RLHF | MD Rate1.09 | 68 | |
| Helpful and Harmless Preference Reasoning | HH-RLHF | Accuracy54.3 | 56 | |
| Preference Alignment | HH-RLHF (test) | Win Rate87.4 | 36 | |
| Preference Alignment | HH-RLHF | ASR99.4 | 32 | |
| Assistant Response Alignment (Helpfulness and Harmlessness) | HH-RLHF (test) | Helpfulness Win Rate89.42 | 31 | |
| Preference Modeling | HH-RLHF | Accuracy61.4 | 30 | |
| LLM Alignment | HH-RLHF (test) | Diversity0.87 | 23 | |
| Question Answering | HH-RLHF | Accuracy59 | 22 | |
| Safety Evaluation | HH-RLHF (test) | Harm Score1.02 | 21 | |
| Helpful Dialogue | Anthropic HH-RLHF helpful core250 (test) | Reward Score18.93 | 18 | |
| LLM Judgement Confidence Estimation | HH-RLHF (test) | RK0.4763 | 16 | |
| LLM Alignment | HH-RLHF 300 prompts | Win/Tie Rate vs Vanilla (GPT-4o)69.8 | 16 | |
| RLHF | HH-RLHF | Human Win Rate74 | 16 | |
| RLHF Alignment | HH-RLHF (held-out) | Win Rate78 | 14 | |
| LLM-as-a-judge | HH-RLHF | Coverage81.3 | 12 | |
| Reward Modeling | HH-RLHF helpful core250 (held-out evaluation) | Reward Score20.155 | 12 | |
| Best-of-N Alignment | HH-RLHF (test) | Percent batches with BWR > 0.5098 | 12 | |
| Alignment | HH-RLHF | Estimated Score (EST)154 | 12 | |
| Best-of-N Alignment | HH-RLHF | BWR53 | 12 | |
| Reward model verification | HH-RLHF | Win Rate47.3 | 12 | |
| Harmlessness evaluation | HH-RLHF harmless (test) | Win Rate83.33 | 12 | |
| Confidence Estimation | HH-RLHF | Rank Correlation (RK)0.4718 | 11 | |
| Helpful Assistant | HH-RLHF | HV Score9.08 | 10 | |
| RLHF | HH-RLHF (held-out) | Peak Gold Reward1.59 | 9 | |
| Certified Poisoning Stability | HH-RLHF | FTS@1100 | 9 |