Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Ranking Reasoning LLMs under Test-Time Scaling

About

Test-time scaling evaluates reasoning LLMs by sampling multiple outputs per prompt, but ranking models in this regime remains underexplored. We formalize dense benchmark ranking under test-time scaling and introduce Scorio, a library that implements statistical ranking methods such as paired-comparison models, item response theory (IRT) models, voting rules, and graph- and spectral-based methods. Across $20$ reasoning models on four Olympiad-style math benchmarks (AIME'24, AIME'25, HMMT'25, and BrUMO'25; up to $N=80$ trials), most full-trial rankings agree closely with the Bayesian gold standard $\mathrm{Bayes}_{\mathcal{U}}@80$ (mean Kendall's $\tau_b = 0.93$--$0.95$), and $19$--$34$ methods recover exactly the same ordering. In the single-trial regime, the best methods reach $\tau_b \approx 0.86$. Using greedy decoding as an empirical prior ($\mathrm{Bayes}_{\mathbf{R}_0}@N$) reduces variance at $N=1$ by $16$--$52\%$, but can bias rankings when greedy and stochastic sampling disagree. These results identify reliable ranking methods for both high- and low-budget test-time scaling. We release Scorio as an open-source library at https://github.com/mohsenhariri/scorio.

Mohsen Hariri, Michael Hinczewski, Jing Ma, Vipin Chaudhary• 2026

Related benchmarks

TaskDatasetResultRank
Gold-standard ranking agreementcombined benchmark--
36
Method ranking self-consistencyCombined benchmark M=120 questions--
30
Ranking Correlation AnalysisAIME 24
Kendall's tau_b (vs. Gold)0.779
1
Ranking Correlation AnalysisAIME 25
Kendall's tau_b (vs Gold Standard)0.798
1
Ranking Correlation AnalysisBrumo 25
Kendall's tau_b (vs. Gold Standard)0.858
1
Ranking Correlation AnalysisCombined AIME'24 AIME'25 HMMT'25 BrUMO'25--
1
Ranking Method EvaluationAIME 24--
1
Ranking Method EvaluationAIME 25--
1
Ranking Method EvaluationBrumo 25--
1
Ranking Method EvaluationCombined AIME'24 AIME'25 HMMT'25 BrUMO'25--
1
Showing 10 of 10 rows

Other info

Follow for update