Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

About

Pass$@k$ is widely used to report the reasoning performance of LLMs, but it often produces unstable and potentially misleading rankings, especially when the number of trials (samples) is limited and computational resources are constrained. We present a principled Bayesian evaluation framework that replaces Pass$@k$ and average accuracy over $N$ trials (avg$@N$) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and a transparent decision rule for differences. Evaluation outcomes are modeled as categorical (not just 0/1) with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric and enabling the use of prior evidence when appropriate. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass$@1$), explaining its empirical robustness while adding principled uncertainty. Empirically, in simulations with known ground-truth success rates and on AIME'24/'25, HMMT'25, and BrUMO'25, the posterior-based procedure achieves faster convergence and greater rank stability than Pass$@k$ and recent variants, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful (non-overlapping credible intervals) versus noise, and it naturally extends to graded, rubric-based evaluations. Together, these results recommend replacing Pass$@k$ for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit. Source code is available at https://github.com/mohsenhariri/scorio

Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	AIME 25	--	40
Mathematical Reasoning	AIME 24	--	40
Mathematical Reasoning	HMMT 25	--	40
Mathematical Reasoning	Brumo 25	--	40
Gold-standard ranking agreement	combined benchmark	Mean Kendall's Tau-b0.8647	36
Method ranking self-consistency	Combined benchmark M=120 questions	Mean Kendall's Tau-b0.8647	30
Ranking Correlation Analysis	Combined AIME'24 AIME'25 HMMT'25 BrUMO'25	Kendall's tau_b (vs Gold Standard)0.865	1

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord