Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
LLM-as-a-Judge Calibration on Arena100K (test)
Loading...
0.227
Test Risk (MSE)
Test oracle envelope
0.22628
0.23114
0.236
0.24086
May 31, 2026
Test Risk (MSE)
Updated 1d ago
Evaluation Results
Method
Method
Links
Test Risk (MSE)
Test oracle envelope
Calibration Budget=Lar...
2026.05
0.227
Information-first + val. K
Calibration Budget=Lar...
2026.05
0.231
Complexity-penalized + val. K
Calibration Budget=Lar...
2026.05
0.232
Path-family envelope
Calibration Budget=Lar...
2026.05
0.233
Best single judge (val.)
Calibration Budget=Lar...
2026.05
0.234
Random path + val. K
Calibration Budget=Lar...
2026.05
0.24
All seven judges
Calibration Budget=Lar...
2026.05
0.245
Feedback
Search any
task
Search any
task