Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RewardBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Reward ModelingRewardBench
Chat Score99.4
216
Reward ModelingRewardBench
Accuracy97.8
166
Reward ModelingRewardBench v1.0 (test)
Average Score0.978
89
Reward ModelingRewardBench Focus 2
Accuracy90.3
82
Reward ModelingRewardBench v2
Accuracy92.1
72
Reward ModelingRewardBench Precise IF 2
Accuracy57.5
70
Reward ModelingRewardBench v2 (test)
Average Score86.5
67
Reward ModelingRewardBench Average 2
Accuracy39.7
52
Reward ModelingRewardBench Math 2
Accuracy35.7
52
Multimodal Reward ModelingMultimodal RewardBench
Accuracy60.7
50
Multimodal Reward ModelingRewardBench Multimodal
Safety Score99.6
44
MLLM-as-a-judge evaluationVL RewardBench
Accuracy80.75
42
Reward ModelingRewardBench Chat
Accuracy96.4
42
Reward ModelingRewardBench 2
Precise IF Score71
41
Reward ModelingRewardBench (full)
Chat Score99.2
41
Reward ModelingRewardBench
Accuracy88.8
36
LLM-as-a-JudgeRewardBench 1.0 (test)
Rstd0.54
36
LLM-as-a-Judge Evaluation ConsistencyRewardBench
Kappa83.25
36
LLM-as-a-JudgeRewardBench
Accuracy92.9
31
Reward ModelingRewardBench 2
Accuracy89.5
30
Pair-wise comparisonRewardBench
Accuracy93.7
29
Reward ModelingRewardBench v1
Accuracy95.5
28
Reward ModelingRewardBench (test)
RWBench0.933
25
Reward ModelingRewardBench latest (test)
Accuracy74.9
24
Uncertainty CalibrationRewardBench
Kuiper0.009
24
Showing 25 of 62 rows