| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| RLHF Alignment | UltraFeedback In-domain v1 (test) | Win Rate81 | 46 | |
| MT-Bench | UltraFeedback | MT-Bench Score8.1 | 42 | |
| AlpacaEval 2.0 | UltraFeedback | LC30 | 42 | |
| Controllable Generation | Code-UltraFeedback | Diversity90.7 | 36 | |
| Generative Performance | Ultrafeedback 61.1k (test) | Win Rate69.8 | 30 | |
| Discriminative Performance | Ultrafeedback 61.1k (test) | Accuracy73.05 | 30 | |
| Preference Alignment | Ultrafeedback 40% flipping ratio | Accuracy78.87 | 12 | |
| Preference Alignment | Ultrafeedback 20% flipping ratio | Accuracy78.8 | 12 | |
| Alignment | UltraFeedback (test) | IF Score68.5 | 11 | |
| LLM Alignment | UltraFeedback (in-domain) | Win Rate (KL, alpha=1)80.6 | 8 | |
| Preference Prediction | UltraFeedback 500 held-out users (test) | Test Accuracy70.53 | 7 | |
| Human Evaluation | UltraFeedback 50 sampled questions | Win Rate (Expert 1)62 | 5 | |
| LLM Alignment | UltraFeedback 2023 (test) | Win-rate55 | 4 | |
| Format Debiasing | UltraFeedback Format-Biased (test) | Win-Rate (Bold)89 | 4 | |
| Model Ranking Prediction | UltraFeedback 13B+ Models Holdout (test) | Pairwise Accuracy (RM1_Honest)74.8 | 4 | |
| Model Ranking Prediction | UltraFeedback 30B+ Models Holdout (test) | Pairwise Acc (RM1_Honest)77.3 | 4 | |
| Model Ranking Prediction | UltraFeedback 70B+ Models Holdout (test) | Pairwise Acc (RM1_Honest)77.4 | 4 | |
| Pairwise Preference Ranking | UltraFeedback 10% holdout (test) | Pairwise Accuracy (RM1-Honest)86.3 | 4 | |
| Pairwise Preference Ranking | UltraFeedback 5% holdout (test) | Pairwise Accuracy (RM1-Honest)87 | 4 | |
| Pairwise Preference Ranking | UltraFeedback 2% holdout (test) | Pairwise Acc (RM1-Honest)89.1 | 4 |