A Finite-Calibration Regime Map for LLM Judge Panels
About
We study when LLM judge panels should be calibrated with low-dimensional stackers versus joint output tables under finite human-label budgets. Low-dimensional stackers have small estimation cost but miss interactions, whereas joint-table calibrators can represent interactions but pay for cell counts and unseen patterns. We cast this tradeoff as a finite-calibration regime map and instantiate it as Finite-Calibration Panel Selection, a deployable validation selector over judge path, prefix size, and aggregator family with table and parametric estimation diagnostics. On RewardBench, LLMBar, SummEval, and Arena100K with a seven-judge pool including DeepSeek V4 Flash, scalar/reliability aggregation wins 16 of 20 real dataset--budget cells, indicating that current judge outputs are often additive or redundant. Controlled calibration-growth data show the complementary regime: additive labels remain scalar-favored, whereas a six-way interaction selects a larger joint table and its test MSE drops from 0.224 to 0.061 once unseen mass vanishes. Thus the practical question is not ``how many judges?'' but whether the next judge's information is estimable under the available human labels.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reward Modeling | RewardBench (test) | -- | 25 | |
| LLM-as-a-Judge Calibration | RewardBench (test) | Test Risk (MSE)0.024 | 7 | |
| LLM-as-a-Judge Calibration | LLMBar (test) | Test Risk (MSE)0.203 | 7 | |
| LLM-as-a-Judge Calibration | SUMMEVAL (test) | Test Risk (MSE)0.044 | 7 | |
| LLM-as-a-Judge Calibration | Arena100K (test) | Test Risk (MSE)0.231 | 7 | |
| Reward Modeling | SUMMEVAL (test) | MSE (Table)0.0444 | 6 | |
| Reward Modeling | LLMBar (test) | Test MSE (Table)0.2039 | 5 | |
| Reward Modeling | Arena100K (test) | Table MSE0.2311 | 4 |