Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RewardBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Reward ModelingRewardBench
Accuracy97.8
166
Reward ModelingRewardBench
Chat Score99.4
146
Reward ModelingRewardBench v1.0 (test)
Average Score0.978
89
Reward ModelingRewardBench Focus 2
Accuracy90.3
82
Reward ModelingRewardBench v2
Accuracy92.1
72
Reward ModelingRewardBench Precise IF 2
Accuracy57.5
70
Reward ModelingRewardBench Average 2
Accuracy39.7
52
Reward ModelingRewardBench Math 2
Accuracy35.7
52
Reward ModelingRewardBench v2 (test)
Average Score86.5
42
LLM-as-a-JudgeRewardBench 1.0 (test)
Rstd0.54
36
LLM-as-a-Judge Evaluation ConsistencyRewardBench
Kappa83.25
36
Multimodal Reward ModelingRewardBench Multimodal
Safety Score99.6
31
Reward ModelingRewardBench 2
Accuracy89.5
30
Multimodal Reward ModelingMultimodal RewardBench
Accuracy88.79
30
Pair-wise comparisonRewardBench
Accuracy93.7
29
Reward ModelingRewardBench v1
Accuracy95.5
28
Reward ModelingRewardBench (test)
RWBench0.933
25
Uncertainty CalibrationRewardBench
Kuiper0.009
24
Reward Modeling EvaluationRewardBench2 (test)
Accuracy82.9
20
Reward ModelingRewardBench 2
L-Acc93.4
20
Reward ModelingRewardBench unified-feedback (test)
Average Score84
20
Multi-modal Preference EvaluationMM-RewardBench
Accuracy72.9
19
Reward ModelingRewardBench Chat
Accuracy96.4
18
Multimodal Reward ModelingMM-RLHF-RewardBench
Pairwise Accuracy92.4
18
Reward ModelingRewardBench 1k
Positional Consistency84.9
16
Showing 25 of 47 rows