Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
LLM-as-a-Judge Calibration on SummEval (test)
Loading...
0.043
Test Risk (MSE)
Test oracle envelope
0.04236
0.04668
0.051
0.05532
May 31, 2026
Test Risk (MSE)
Updated 1d ago
Evaluation Results
Method
Method
Links
Test Risk (MSE)
Test oracle envelope
Calibration Budget=Lar...
2026.05
0.043
Information-first + val. K
Calibration Budget=Lar...
2026.05
0.044
Best single judge (val.)
Calibration Budget=Lar...
2026.05
0.045
Path-family envelope
Calibration Budget=Lar...
2026.05
0.045
Complexity-penalized + val. K
Calibration Budget=Lar...
2026.05
0.046
Random path + val. K
Calibration Budget=Lar...
2026.05
0.048
All seven judges
Calibration Budget=Lar...
2026.05
0.059
Feedback
Search any
task
Search any
task