When LLM Judge Scores Look Good but Best-of-N Decisions Fail

About

Large language models are often used as judges to score candidate responses, then validated with a single global metric such as correlation with reference labels. This can be misleading when the real deployment task is best-of-n selection within a prompt. In a 5,000-prompt best-of-4 benchmark from Chatbot Arena, a judge with moderate global correlation (r = 0.47) captures only 21.0% of the improvement that perfect selection would achieve over random choice. The gap arises because global agreement is driven largely by prompt-level baseline effects, while selection depends on within-prompt ranking: within-prompt correlation is only r_within = 0.27, and coarse pointwise scoring creates ties in 67% of pairwise comparisons. In a matched-pair best-of-2 audit, explicit pairwise judging recovers much of this lost signal, raising recovery from 21.1% to 61.2%. For judge-based selection, the relevant audit should report within-prompt signal, tie rates, and recovery/top-1 accuracy, not global agreement alone.

Eddie Landesberg• 2026

Related benchmarks

Task	Dataset	Result	Rank
Multi-judge evaluation	Shared 500-prompt sample	--		5
LLM-judge evaluation	LLM-to-LLM Evaluation Reference: GPT-5.2	--		2

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord