| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Safety Alignment | HH-RLHF | MD Rate1.09 | 68 | |
| Helpful and Harmless Preference Reasoning | HH-RLHF | Accuracy54.3 | 56 | |
| Preference Alignment | HH-RLHF (test) | Win Rate87.4 | 36 | |
| Preference Alignment | HH-RLHF | BLEU0.275 | 31 | |
| Assistant Response Alignment (Helpfulness and Harmlessness) | HH-RLHF (test) | Helpfulness Win Rate89.42 | 31 | |
| Safety Evaluation | HH-RLHF (test) | Harm Score1.02 | 21 | |
| LLM Alignment | HH-RLHF (test) | Win Rate80.3 | 21 | |
| LLM Alignment | HH-RLHF 300 prompts | Win/Tie Rate vs Vanilla (GPT-4o)69.8 | 16 | |
| RLHF | HH-RLHF | Human Win Rate74 | 16 | |
| Best-of-N Alignment | HH-RLHF (test) | Percent batches with BWR > 0.5098 | 12 | |
| Alignment | HH-RLHF | Estimated Score (EST)154 | 12 | |
| Best-of-N Alignment | HH-RLHF | BWR53 | 12 | |
| Reward model verification | HH-RLHF | Win Rate47.3 | 12 | |
| Harmlessness evaluation | HH-RLHF harmless (test) | Win Rate83.33 | 12 | |
| Certified Poisoning Stability | HH-RLHF | FTS@1100 | 9 | |
| Dialogue generation | full-hh-rlhf (test) | Win Rate (Beaver-7b-v3.0-reward)79.3 | 8 | |
| Helpfulness evaluation | HH-RLHF helpful (test) | Helpfulness Fraction77 | 7 | |
| Pairwise preference comparison | HH-RLHF held-out (test) | Win Rate53.02 | 6 | |
| Validity Certification | HH-RLHF (test) | FTV@k=1100 | 6 | |
| Constitutional AI Alignment | HH-RLHF (test) | Likert Score Ranking4.596 | 6 | |
| Controllable multi-objective generation | HH-RLHF Helpful vs Harmless (test) | Hypervolume1.24 | 6 | |
| HH-RLHF | HH-RLHF | Hyper-volume10.435 | 5 | |
| Model Alignment | HH-RLHF D3 (test) | Harmlessness BLEU Score32.77 | 5 | |
| Model Alignment | HH-RLHF D2 (test) | Harmlessness BLEU20.13 | 5 | |
| Model Alignment | HH-RLHF 0-shot (test) | Harmlessness BLEU62.68 | 5 |