Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Distribution-Calibrated Inference time compute for Thinking LLM-as-a-Judge

About

Thinking Large Language Models (LLMs) used as judges for pairwise preferences remain noisy at the single-sample level, and common aggregation rules (majority vote, soft self-consistency, or instruction-based self-aggregation) are inconsistent when ties are allowed. We study inference-time compute (ITC) for evaluators that generate n independent thinking-rating samples per item, and propose a principled, distribution-calibrated aggregation scheme. Our method models three-way preferences with a Bradley-Terry-Davidson formulation on rating counts, leveraging both polarity (margin among non-ties) and decisiveness (non-tie rate) to distinguish narrow margins from strong consensus. Across various evaluation benchmarks, our approach consistently reduces MAE and increases pairwise accuracy versus standard baselines, and when evaluated against human-consensus meta-labels, matches or exceeds individual human raters. These results show that carefully allocating ITC and aggregating with distribution-aware methods turns noisy individual model judgments into reliable ratings for evaluation.

Hamid Dadkhahi, Firas Trabelsi, Parker Riley, Juraj Juraska, Mehdi Mirzazadeh• 2025

Related benchmarks

TaskDatasetResultRank
Reward ModelingRewardBench Focus 2
Accuracy72.5
82
Reward ModelingRewardBench Precise IF 2--
70
Reward Modeling EvaluationReward Bench Factuality 2
Pairwise Accuracy56.6
64
Judge AlignmentWMT Zh-En
Pairwise Accuracy62.4
40
Machine Translation EvaluationWMT 2023 (test)
MAE (EN→DE)0.588
12
Reward Model EvaluationReward Bench 2 (test)
RB2 Factuality MAE0.451
12
Reward Modeling EvaluationReward Bench Math 2
Pairwise Accuracy72.3
12
Reward Modeling EvaluationReward Bench Safety 2
Pairwise Accuracy72.3
12
Reward Modeling EvaluationReward Bench Ties 2
Pairwise Accuracy91.8
12
Translation Preference PredictionWMT en-de
Pairwise Acc51.6
12
Showing 10 of 11 rows

Other info

Follow for update