| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Worst-Case Estimation Error | HelpSteer2 (test) | WCE18.7 | 48 | |
| Reward Modeling | HelpSteer 3 | Accuracy83.15 | 39 | |
| Controllable Generation | HelpSteer2 | Diversity0.987 | 36 | |
| Pair-wise comparison | HelpSteer2 | Accuracy72.3 | 16 | |
| Attribute-controlled Text Generation | HelpSteer2 Relative Positive Representative Target | Diversity0.946 | 12 | |
| NLG Evaluation | HelpSteer2 | Spearman Correlation0.65 | 10 | |
| Human-Metric Correlation | HelpSteer2 (In-Distribution) | Kendall's Tau0.342 | 9 | |
| Helpful Response Evaluation | HelpSteer-2 | CV (Helpfulness)0.03 | 7 | |
| Computational cost comparison | HelpSteer2 (test) | GPU Hours0.02 | 6 | |
| Preference Optimization | HelpSteer2 (test) | Avg Pref Score vs QWEN2.5-0.5B0.8 | 5 | |
| Model Ranking Prediction | Helpsteer 13B+ Models Holdout (test) | Acc_pair (RM1 Helpful)74.1 | 4 | |
| Model Ranking Prediction | Helpsteer 30B+ Models Holdout (test) | Pairwise Accuracy (RM1)76.5 | 4 | |
| Model Ranking Prediction | Helpsteer 70B+ Models Holdout (test) | Pairwise Acc (RM1)77.8 | 4 | |
| Pairwise Preference Ranking | Helpsteer 10% holdout (test) | Pairwise Acc (RM1-Helpful)85.5 | 4 | |
| Pairwise Preference Ranking | Helpsteer 5% holdout (test) | Pairwise Accuracy (RM1-Helpful)84.9 | 4 | |
| Pairwise Preference Ranking | Helpsteer 2% holdout (test) | Pairwise Acc (RM1)86.6 | 4 | |
| Pareto frontier approximation | HelpSteer2 | Hypervolume (HV)12.66 | 3 | |
| Controllable Model Distillation | HelpSteer2 | HV16.81 | 3 |