| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Reward Modeling | HelpSteer (test) | MAE0.077 | 65 | |
| Reward Modeling | HelpSteer 3 | Accuracy83.15 | 62 | |
| Ordinal Regression | Helpsteer | L1 Error0.72 | 48 | |
| Worst-Case Estimation Error | HelpSteer2 (test) | WCE18.7 | 48 | |
| Controllable Generation | HelpSteer2 | Diversity0.987 | 36 | |
| Pointwise evaluation | HelpSteer2 | Spearman Correlation0.464 | 28 | |
| LLM Alignment | HelpSteer (test) | AlpacaEval 2 WR8.34 | 27 | |
| Attribute Steering | HelpSteer | Helpfulness3.89 | 22 | |
| Sequential Preference Optimization | HelpSteer2 | Harmless Rate99.86 | 20 | |
| Personalization | HelpSteer | Creative ArmoRM Score0.51 | 18 | |
| Reward Model Transfer | HelpSteer3 (H3) v1 (test) | AOG8.96 | 16 | |
| Pair-wise comparison | HelpSteer2 | Accuracy72.3 | 16 | |
| Test-Time Personalization | HelpSteer | Creative Win Rate99.5 | 15 | |
| Prompt Optimization Evaluation | HelpSteer2 | Helpfulness0.5072 | 14 | |
| Prompt Optimization Evaluation | HelpSteer 1 | Helpfulness51.83 | 14 | |
| Attribute-controlled Text Generation | HelpSteer2 Relative Positive Representative Target | Diversity0.946 | 12 | |
| Preference Alignment | HelpSteer3 | Score-5.89 | 10 | |
| NLG Evaluation | HelpSteer2 | Spearman Correlation0.65 | 10 | |
| Human-Metric Correlation | HelpSteer2 (In-Distribution) | Kendall's Tau0.342 | 9 | |
| Preference Modeling | HelpSteer2 held-out (test) | Preference Accuracy68.4 | 7 | |
| Helpful Response Evaluation | HelpSteer-2 | CV (Helpfulness)0.03 | 7 | |
| Computational cost comparison | HelpSteer2 (test) | GPU Hours0.02 | 6 | |
| Preference Alignment | HelpSteer (test) | Pairwise Win Rate (excl. ties)61.25 | 5 | |
| Reward Modeling Evaluation | HelpSteer3 (test) | Score-5.89 | 5 | |
| Preference Optimization | HelpSteer2 (test) | Avg Pref Score vs QWEN2.5-0.5B0.8 | 5 |