Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

JudgeBench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Uncertainty EstimationJudgeBench (test)
AUROC71.53
77
Reward ModelingJudgeBench
Accuracy93.3
45
Reward ModelingJudgeBench (test)
Overall82
40
Uncertainty CalibrationJudgeBench
Kuiper0.037
24
Pair-wise comparisonJudgeBench
Accuracy75.7
16
Reward ModelingJudgeBench Knowledge
Accuracy74.4
16
Reward ModelingJudgeBench
Knowledge62.3
13
Model EvaluationJudgeBench (test)
Kuiper5.63
8
LLM-as-a-JudgeJudgeBench
Accuracy84.19
8
Showing 9 of 9 rows