Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
LLM-as-a-Judge Calibration on RewardBench (test)
Loading...
0.022
Test Risk (MSE)
Test oracle envelope
0.0216
0.0243
0.027
0.0297
May 31, 2026
Test Risk (MSE)
Updated 1d ago
Evaluation Results
Method
Method
Links
Test Risk (MSE)
Test oracle envelope
Calibration Budget=Lar...
2026.05
0.022
Information-first + val. K
Calibration Budget=Lar...
2026.05
0.024
Complexity-penalized + val. K
Calibration Budget=Lar...
2026.05
0.025
Path-family envelope
Calibration Budget=Lar...
2026.05
0.025
Random path + val. K
Calibration Budget=Lar...
2026.05
0.027
All seven judges
Calibration Budget=Lar...
2026.05
0.028
Best single judge (val.)
Calibration Budget=Lar...
2026.05
0.032
Feedback
Search any
task
Search any
task