Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

JudgeBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Reward ModelingJudgeBench
Accuracy93.3
105
Uncertainty EstimationJudgeBench (test)
AUROC71.53
77
Reward ModelingJudgeBench (test)
Overall82
40
LLM-as-a-JudgeJudgeBench
Accuracy84.19
29
Uncertainty CalibrationJudgeBench
Kuiper0.037
24
LLM-as-a-Judge EvaluationJudgeBench (test)
Score83.4
22
LLM EvaluationJudgeBench (test)
Knowledge79.9
16
Pair-wise comparisonJudgeBench
Accuracy75.7
16
Reward ModelingJudgeBench Knowledge
Accuracy74.4
16
Reward ModelingJudgeBench
Knowledge62.3
13
Preference PredictionJudgeBench
Positional Consistent Accuracy63.9
10
Reward ModelingJudgeBench
Positional Consistency Score56.3
8
LLM-as-a-JudgeJudgeBench (Merged GPT Claude)
Direct Baseline Score87.38
8
Model EvaluationJudgeBench (test)
Kuiper5.63
8
Showing 14 of 14 rows