Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

A Finite-Calibration Regime Map for LLM Judge Panels

About

We study when LLM judge panels should be calibrated with low-dimensional stackers versus joint output tables under finite human-label budgets. Low-dimensional stackers have small estimation cost but miss interactions, whereas joint-table calibrators can represent interactions but pay for cell counts and unseen patterns. We cast this tradeoff as a finite-calibration regime map and instantiate it as Finite-Calibration Panel Selection, a deployable validation selector over judge path, prefix size, and aggregator family with table and parametric estimation diagnostics. On RewardBench, LLMBar, SummEval, and Arena100K with a seven-judge pool including DeepSeek V4 Flash, scalar/reliability aggregation wins 16 of 20 real dataset--budget cells, indicating that current judge outputs are often additive or redundant. Controlled calibration-growth data show the complementary regime: additive labels remain scalar-favored, whereas a six-way interaction selects a larger joint table and its test MSE drops from 0.224 to 0.061 once unseen mass vanishes. Thus the practical question is not ``how many judges?'' but whether the next judge's information is estimable under the available human labels.

Bin Zhu, Yanghui Rao• 2026

Related benchmarks

TaskDatasetResultRank
Reward ModelingRewardBench (test)--
25
LLM-as-a-Judge CalibrationRewardBench (test)
Test Risk (MSE)0.024
7
LLM-as-a-Judge CalibrationLLMBar (test)
Test Risk (MSE)0.203
7
LLM-as-a-Judge CalibrationSUMMEVAL (test)
Test Risk (MSE)0.044
7
LLM-as-a-Judge CalibrationArena100K (test)
Test Risk (MSE)0.231
7
Reward ModelingSUMMEVAL (test)
MSE (Table)0.0444
6
Reward ModelingLLMBar (test)
Test MSE (Table)0.2039
5
Reward ModelingArena100K (test)
Table MSE0.2311
4
Showing 8 of 8 rows

Other info

Follow for update