Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation
About
Pass$@k$ is widely used to report the reasoning performance of LLMs, but it often produces unstable and potentially misleading rankings, especially when the number of trials (samples) is limited and computational resources are constrained. We present a principled Bayesian evaluation framework that replaces Pass$@k$ and average accuracy over $N$ trials (avg$@N$) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and a transparent decision rule for differences. Evaluation outcomes are modeled as categorical (not just 0/1) with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric and enabling the use of prior evidence when appropriate. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass$@1$), explaining its empirical robustness while adding principled uncertainty. Empirically, in simulations with known ground-truth success rates and on AIME'24/'25, HMMT'25, and BrUMO'25, the posterior-based procedure achieves faster convergence and greater rank stability than Pass$@k$ and recent variants, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful (non-overlapping credible intervals) versus noise, and it naturally extends to graded, rubric-based evaluations. Together, these results recommend replacing Pass$@k$ for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit. Source code is available at https://github.com/mohsenhariri/scorio
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | AIME 25 | -- | 40 | |
| Mathematical Reasoning | AIME 24 | -- | 40 | |
| Mathematical Reasoning | HMMT 25 | -- | 40 | |
| Mathematical Reasoning | Brumo 25 | -- | 40 | |
| Gold-standard ranking agreement | combined benchmark | Mean Kendall's Tau-b0.8647 | 36 | |
| Method ranking self-consistency | Combined benchmark M=120 questions | Mean Kendall's Tau-b0.8647 | 30 | |
| Ranking Correlation Analysis | Combined AIME'24 AIME'25 HMMT'25 BrUMO'25 | Kendall's tau_b (vs Gold Standard)0.865 | 1 |