Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

HelpSteer

Benchmarks

Task NameDataset NameSOTA ResultTrend
Worst-Case Estimation ErrorHelpSteer2 (test)
WCE18.7
48
Reward ModelingHelpSteer 3
Accuracy83.15
39
Controllable GenerationHelpSteer2
Diversity0.987
36
Pair-wise comparisonHelpSteer2
Accuracy72.3
16
Attribute-controlled Text GenerationHelpSteer2 Relative Positive Representative Target
Diversity0.946
12
NLG EvaluationHelpSteer2
Spearman Correlation0.65
10
Human-Metric CorrelationHelpSteer2 (In-Distribution)
Kendall's Tau0.342
9
Helpful Response EvaluationHelpSteer-2
CV (Helpfulness)0.03
7
Computational cost comparisonHelpSteer2 (test)
GPU Hours0.02
6
Preference OptimizationHelpSteer2 (test)
Avg Pref Score vs QWEN2.5-0.5B0.8
5
Model Ranking PredictionHelpsteer 13B+ Models Holdout (test)
Acc_pair (RM1 Helpful)74.1
4
Model Ranking PredictionHelpsteer 30B+ Models Holdout (test)
Pairwise Accuracy (RM1)76.5
4
Model Ranking PredictionHelpsteer 70B+ Models Holdout (test)
Pairwise Acc (RM1)77.8
4
Pairwise Preference RankingHelpsteer 10% holdout (test)
Pairwise Acc (RM1-Helpful)85.5
4
Pairwise Preference RankingHelpsteer 5% holdout (test)
Pairwise Accuracy (RM1-Helpful)84.9
4
Pairwise Preference RankingHelpsteer 2% holdout (test)
Pairwise Acc (RM1)86.6
4
Pareto frontier approximationHelpSteer2
Hypervolume (HV)12.66
3
Controllable Model DistillationHelpSteer2
HV16.81
3
Showing 18 of 18 rows