Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

JudgeBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Reward ModelingJudgeBench
Accuracy93.3
117
Uncertainty EstimationJudgeBench (test)
AUROC71.53
77
Reward ModelingJudgeBench (test)
Overall82
40
LLM-as-a-JudgeJudgeBench
Accuracy84.19
29
Uncertainty CalibrationJudgeBench
Kuiper0.037
24
LLM-as-a-Judge EvaluationJudgeBench (test)
Score83.4
22
Reward ModelingJudgeBench
Knowledge74.6
22
LLM EvaluationJudgeBench (test)
Knowledge79.9
16
Pair-wise comparisonJudgeBench
Accuracy75.7
16
Reward ModelingJudgeBench Knowledge
Accuracy74.4
16
LLM JudgingJudgeBench response pairs generated by GPT-4o 1.0
Knowledge68.18
11
Preference PredictionJudgeBench
Positional Consistent Accuracy63.9
10
Discriminative AccuracyJudgeBench
Knowledge Accuracy77.3
8
Reward ModelingJudgeBench
Positional Consistency Score56.3
8
LLM-as-a-JudgeJudgeBench (Merged GPT Claude)
Direct Baseline Score87.38
8
Model EvaluationJudgeBench (test)
Kuiper5.63
8
Showing 16 of 16 rows