| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| RLHF Alignment | UltraFeedback In-domain v1 (test) | Win Rate81 | 46 | |
| MT-Bench | UltraFeedback | MT-Bench Score8.1 | 42 | |
| AlpacaEval 2.0 | UltraFeedback | LC30 | 42 | |
| Controllable Generation | Code-UltraFeedback | Diversity90.7 | 36 | |
| Generative Performance | Ultrafeedback 61.1k (test) | Win Rate69.8 | 30 | |
| Discriminative Performance | Ultrafeedback 61.1k (test) | Accuracy73.05 | 30 | |
| Reward Modeling | UltraFeedback (test) | MAE0.1679 | 21 | |
| LLM Alignment | UltraFeedback (test) | AlpacaEval 2 Win Rate (WR)21 | 18 | |
| Preference Alignment | Ultrafeedback 40% flipping ratio | Accuracy78.87 | 12 | |
| Preference Alignment | Ultrafeedback 20% flipping ratio | Accuracy78.8 | 12 | |
| Preference Alignment | UltraFeedback (test) | Accuracy74.18 | 11 | |
| Direct Preference Optimization | UltraFeedback | Accuracy69.92 | 11 | |
| Alignment | UltraFeedback (test) | IF Score68.5 | 11 | |
| Correctness Assessment | UltraFeedback Property Constraints Satisfaction (test) | Worst-case Size Distortion (Helpfulness & Instruction-following)0.07 | 9 | |
| Preference Optimization Evaluation | UltraFeedback (test_prefs) | Pair Accuracy57.65 | 8 | |
| Reward Modeling | UltraFeedback Cleaned | Total Score92.36 | 8 | |
| LLM Alignment | UltraFeedback (in-domain) | Win Rate (KL, alpha=1)80.6 | 8 | |
| Preference Prediction | UltraFeedback 500 held-out users (test) | Test Accuracy70.53 | 7 | |
| scoring | UltraFeedback | MAE0.623 | 5 | |
| Human Evaluation | UltraFeedback 50 sampled questions | Win Rate (Expert 1)62 | 5 | |
| Instruction Following | ultrafeedback-prompt (test) | Win Rate52 | 4 | |
| Instruction-following | UltraFeedback | Win Rate80.8 | 4 | |
| LLM Alignment | UltraFeedback 2023 (test) | Win-rate55 | 4 | |
| Format Debiasing | UltraFeedback Format-Biased (test) | Win-Rate (Bold)89 | 4 | |
| Model Ranking Prediction | UltraFeedback 13B+ Models Holdout (test) | Pairwise Accuracy (RM1_Honest)74.8 | 4 |