Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

RewardBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Reward ModelingRewardBench
Avg Score95.1
118
Reward ModelingRewardBench Focus 2
Accuracy90.3
82
Reward ModelingRewardBench Precise IF 2
Accuracy57.5
70
Reward ModelingRewardBench
Accuracy95.1
70
Reward ModelingRewardBench Average 2
Accuracy39.7
52
Reward ModelingRewardBench Math 2
Accuracy35.7
52
LLM-as-a-JudgeRewardBench 1.0 (test)
Rstd0.54
36
LLM-as-a-Judge Evaluation ConsistencyRewardBench
Kappa83.25
36
Reward ModelingRewardBench v1.0 (test)
Chat Score0.9777
27
Reward ModelingRewardBench (test)
RWBench0.933
25
Uncertainty CalibrationRewardBench
Kuiper0.009
24
Reward ModelingRewardBench 2
L-Acc93.4
20
Reward ModelingRewardBench unified-feedback (test)
Average Score84
20
Multi-modal Preference EvaluationMM-RewardBench
Accuracy72.9
19
Reward ModelingRewardBench Chat
Accuracy96.4
18
Multimodal Reward ModelingMultimodal RewardBench
Accuracy85.4
17
Pairwise LLM JudgingRewardBench
Coverage100
16
Pair-wise comparisonRewardBench
Accuracy93.7
16
Reward ModelingRewardBench v2
Accuracy90.7
14
Reward ModelingRewardBench latest (full)
Average Score93.6
11
Listwise JudgingRewardBench listwise 2
IF Score58.1
10
Reward ModelingRewardBench 2 (test)
RWBench2 Score76.3
9
Multimodal Reward ModelingVL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench Aggregate
Accuracy82.44
9
Multimodal Reward ModelingMM-RLHF-RewardBench
Accuracy85.88
9
LLM-as-a-JudgeRewardBench (test)
Std Dev (Reward)2.72
9
Showing 25 of 30 rows