Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HelpSteer

Benchmarks

Task NameDataset NameSOTA ResultTrend
Reward ModelingHelpSteer (test)
MAE0.2428
48
Worst-Case Estimation ErrorHelpSteer2 (test)
WCE18.7
48
Reward ModelingHelpSteer 3
Accuracy83.15
39
Controllable GenerationHelpSteer2
Diversity0.987
36
LLM AlignmentHelpSteer (test)
AlpacaEval 2 WR8.34
27
Pair-wise comparisonHelpSteer2
Accuracy72.3
16
Prompt Optimization EvaluationHelpSteer2
Helpfulness0.5072
14
Prompt Optimization EvaluationHelpSteer 1
Helpfulness51.83
14
Attribute-controlled Text GenerationHelpSteer2 Relative Positive Representative Target
Diversity0.946
12
NLG EvaluationHelpSteer2
Spearman Correlation0.65
10
Human-Metric CorrelationHelpSteer2 (In-Distribution)
Kendall's Tau0.342
9
Preference ModelingHelpSteer2 held-out (test)
Preference Accuracy68.4
7
Helpful Response EvaluationHelpSteer-2
CV (Helpfulness)0.03
7
Computational cost comparisonHelpSteer2 (test)
GPU Hours0.02
6
Preference OptimizationHelpSteer2 (test)
Avg Pref Score vs QWEN2.5-0.5B0.8
5
Model Ranking PredictionHelpsteer 13B+ Models Holdout (test)
Acc_pair (RM1 Helpful)74.1
4
Model Ranking PredictionHelpsteer 30B+ Models Holdout (test)
Pairwise Accuracy (RM1)76.5
4
Model Ranking PredictionHelpsteer 70B+ Models Holdout (test)
Pairwise Acc (RM1)77.8
4
Pairwise Preference RankingHelpsteer 10% holdout (test)
Pairwise Acc (RM1-Helpful)85.5
4
Pairwise Preference RankingHelpsteer 5% holdout (test)
Pairwise Accuracy (RM1-Helpful)84.9
4
Pairwise Preference RankingHelpsteer 2% holdout (test)
Pairwise Acc (RM1)86.6
4
Pareto frontier approximationHelpSteer2
Hypervolume (HV)12.66
3
Controllable Model DistillationHelpSteer2
HV16.81
3
Reasoning Quality EvaluationHelpSteer1 sampled (train)
Usefulness Score3.89
2
Showing 24 of 24 rows