| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| RLHF Alignment | UltraFeedback In-domain v1 (test) | Win Rate81 | 46 | |
| MT-Bench | UltraFeedback | MT-Bench Score8.1 | 42 | |
| AlpacaEval 2.0 | UltraFeedback | LC30 | 42 | |
| Reward Modeling | UltraFeedback (test) | MAE0.145 | 38 | |
| Controllable Generation | Code-UltraFeedback | Diversity90.7 | 36 | |
| Generative Performance | Ultrafeedback 61.1k (test) | Win Rate69.8 | 30 | |
| Discriminative Performance | Ultrafeedback 61.1k (test) | Accuracy73.05 | 30 | |
| Response Generation | UltraFeedback (val) | BERTScore88.1 | 24 | |
| LLM Judgment | UltraFeedback | Accuracy68.75 | 23 | |
| Multi-turn Conversation Evaluation | UltraFeedback | MT-Bench Score6.1 | 20 | |
| Sequential Preference Optimization | UltraFeedback | Harmless Rate99.71 | 20 | |
| Alignment | UltraFeedback (test) | Honesty Score63.72 | 20 | |
| Best-of-N Reward Evaluation | UltraFeedback core250 | Reward Score24.323 | 18 | |
| Reward Modeling | UltraFeedback core250 (held-out evaluation) | Delta (Δ)3.543 | 18 | |
| LLM Alignment | UltraFeedback (test) | AlpacaEval 2 Win Rate (WR)21 | 18 | |
| Reward Model Transfer | UltraFeedback (UF) | AOG7.93 | 16 | |
| Preference Alignment | UltraFeedback | Win Rate81 | 16 | |
| Instruction Following | UltraFeedback (core250) | Delta Preference Score (bo64)12.568 | 15 | |
| Pairwise Judge Comparison | UltraFeedback core250 | Win Count (W)161 | 14 | |
| Preference Evaluation | UltraFeedback core250 (test) | Win Rate80 | 12 | |
| Preference Alignment | Ultrafeedback 40% flipping ratio | Accuracy78.87 | 12 | |
| Preference Alignment | Ultrafeedback 20% flipping ratio | Accuracy78.8 | 12 | |
| Preference Alignment | UltraFeedback (test) | Accuracy74.18 | 11 | |
| Direct Preference Optimization | UltraFeedback | Accuracy69.92 | 11 | |
| Multi-agent Reasoning | Ultrafeedback | Accuracy73.66 | 9 |