Share your thoughts, 1 month free Claude Pro on us
See more
Feedback
Search any
task
Search any
task
SOTA LLM-as-a-Judge benchmarks and papers with code | Wizwand
Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Tasks
LLM-as-a-Judge
Benchmarks
Dataset Name
SOTA Method
Dataset Name
SOTA Method
Metric
Trend
Results
Last Updated
MTbench (test)
DI
StdDev
2.24
45
1mo ago
PreferenceBench
CalibraEval
Rstd
0.69
36
1mo ago
RewardBench 1.0 (test)
CC
Rstd
0.54
36
1mo ago
JudgeBench
DeepSeek-V3
Accuracy
84.19
29
25d ago
PreferenceBench
PA-GRPO
Accuracy
90.2
21
25d ago
MT-Bench
PA-GRPO
Accuracy
81.4
21
25d ago
ARENA
EpiPersona-A
Accuracy
66.07
20
18d ago
PRISM
EpiPersona-A
Accuracy
59.38
20
18d ago
PRISM (test)
SynthesizeMe
Accuracy
58.9
14
1mo ago
Chatbot Arena (test)
Gemini-2.5-Pro
Accuracy
68.13
14
1mo ago
FairJudge Benchmark 1K (test)
FairJudge-8B
Agreement
71.5
13
1mo ago
JudgeLM (test)
Qwen2.5-72B
Agreement
79.59
13
1mo ago
PandaLM Human Annotations (test)
FairJudge-8B
Agreement
0.7683
13
1mo ago
Preference Bench (test)
CalibraEval
Std Dev
2.82
9
1mo ago
RewardBench (test)
CalibraEval
Std Dev (Reward)
2.72
9
1mo ago
JudgeBench (Merged GPT Claude)
qwen3.5-35b
Direct Baseline Score
87.38
8
11d ago
RewardBench
Qwen3-Next-80B-A3B-Thinking
Accuracy
92.9
8
1mo ago
KD-DTI (test)
GPT-4o-Mini
EM Change
53.41
8
1mo ago
DDI (test)
GPT-4o-Mini
EM (Δ)
59.03
8
1mo ago
BC5CDR (test)
GPT-4o-Mini
EM
48.35
8
1mo ago
BigGen-Bench (test)
LLaDA (FS)
Pearson Correlation
0.312
4
11d ago
Showing 21 of 21 rows
25 / page
50 / page
100 / page
1
Search any
task
Search any
task
Privacy Policy
Terms of Service
FAQs
Swarm Docs