Confident Rankings with Fewer Items: Adaptive LLM Evaluation with Continuous Scores
About
Computerized Adaptive Testing (CAT) has proven effective for efficient LLM evaluation on multiple-choice benchmarks, but modern LLM evaluation increasingly relies on generation tasks where outputs are scored continuously rather than marked correct/incorrect. We present a principled extension of IRT-based adaptive testing to continuous bounded scores (ROUGE, BLEU, LLM-as-a-Judge) by replacing the Bernoulli response distribution with a heteroskedastic normal distribution. Building on this, we introduce an uncertainty aware ranker with adaptive stopping criteria that achieves reliable model ranking while testing as few items and as cheaply as possible. We validate our method on five benchmarks spanning n-gram-based, embedding-based, and LLM-as-judge metrics. Our method uses 2% of the items while improving ranking correlation by 0.12 {\tau} over random sampling, with 95% accuracy on confident predictions.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Model Ranking | BioLaySumm ROUGE-L (test) | Kendall Tau (τ)0.957 | 2 | |
| Model Ranking | BioLaySumm BERTScore (test) | Kendall's Tau0.903 | 2 | |
| Model Ranking | BioLaySumm FKGL (test) | Kendall Tau (τ)0.8 | 2 | |
| Model Ranking | TruthfulQA LLM-Judge (test) | Kendall's Tau0.49 | 2 | |
| Model Ranking | TruthfulQA BERTScore (test) | Kendall's Tau0.45 | 2 | |
| Model Ranking | FLORES BLEU (test) | Kendall Tau (τ)0.803 | 2 | |
| Model Ranking | FLORES COMET (test) | Kendall Tau0.677 | 2 | |
| Model Ranking | GovReport ROUGE-L (test) | Kendall Tau (τ)0.8 | 2 | |
| Model Ranking | Nemotron F1 (test) | Kendall Tau0.673 | 2 |