A Finite-Calibration Regime Map for LLM Judge Panels

About

We study when LLM judge panels should be calibrated with low-dimensional stackers versus joint output tables under finite human-label budgets. Low-dimensional stackers have small estimation cost but miss interactions, whereas joint-table calibrators can represent interactions but pay for cell counts and unseen patterns. We cast this tradeoff as a finite-calibration regime map and instantiate it as Finite-Calibration Panel Selection, a deployable validation selector over judge path, prefix size, and aggregator family with table and parametric estimation diagnostics. On RewardBench, LLMBar, SummEval, and Arena100K with a seven-judge pool including DeepSeek V4 Flash, scalar/reliability aggregation wins 16 of 20 real dataset--budget cells, indicating that current judge outputs are often additive or redundant. Controlled calibration-growth data show the complementary regime: additive labels remain scalar-favored, whereas a six-way interaction selects a larger joint table and its test MSE drops from 0.224 to 0.061 once unseen mass vanishes. Thus the practical question is not ``how many judges?'' but whether the next judge's information is estimable under the available human labels.

Bin Zhu, Yanghui Rao• 2026

Related benchmarks

Task	Dataset	Result
Reward Modeling	RewardBench (test)	--	25
LLM-as-a-Judge Calibration	RewardBench (test)	Test Risk (MSE)0.024	7
LLM-as-a-Judge Calibration	LLMBar (test)	Test Risk (MSE)0.203	7
LLM-as-a-Judge Calibration	SUMMEVAL (test)	Test Risk (MSE)0.044	7
LLM-as-a-Judge Calibration	Arena100K (test)	Test Risk (MSE)0.231	7
Reward Modeling	SUMMEVAL (test)	MSE (Table)0.0444	6
Reward Modeling	LLMBar (test)	Test MSE (Table)0.2039	5
Reward Modeling	Arena100K (test)	Table MSE0.2311	4

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord