Distribution-Calibrated Inference time compute for Thinking LLM-as-a-Judge

About

Thinking Large Language Models (LLMs) used as judges for pairwise preferences remain noisy at the single-sample level, and common aggregation rules (majority vote, soft self-consistency, or instruction-based self-aggregation) are inconsistent when ties are allowed. We study inference-time compute (ITC) for evaluators that generate n independent thinking-rating samples per item, and propose a principled, distribution-calibrated aggregation scheme. Our method models three-way preferences with a Bradley-Terry-Davidson formulation on rating counts, leveraging both polarity (margin among non-ties) and decisiveness (non-tie rate) to distinguish narrow margins from strong consensus. Across various evaluation benchmarks, our approach consistently reduces MAE and increases pairwise accuracy versus standard baselines, and when evaluated against human-consensus meta-labels, matches or exceeds individual human raters. These results show that carefully allocating ITC and aggregating with distribution-aware methods turns noisy individual model judgments into reliable ratings for evaluation.

Hamid Dadkhahi, Firas Trabelsi, Parker Riley, Juraj Juraska, Mehdi Mirzazadeh• 2025

Related benchmarks

Task	Dataset	Result
Reward Modeling	RewardBench Focus 2	Accuracy72.5	82
Reward Modeling	RewardBench Precise IF 2	--	70
Reward Modeling Evaluation	Reward Bench Factuality 2	Pairwise Accuracy56.6	64
Judge Alignment	WMT Zh-En	Pairwise Accuracy62.4	40
Machine Translation Evaluation	WMT 2023 (test)	MAE (EN→DE)0.588	12
Reward Model Evaluation	Reward Bench 2 (test)	RB2 Factuality MAE0.451	12
Reward Modeling Evaluation	Reward Bench Math 2	Pairwise Accuracy72.3	12
Reward Modeling Evaluation	Reward Bench Safety 2	Pairwise Accuracy72.3	12
Reward Modeling Evaluation	Reward Bench Ties 2	Pairwise Accuracy91.8	12
Translation Preference Prediction	WMT en-de	Pairwise Acc51.6	12

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord