Confident Rankings with Fewer Items: Adaptive LLM Evaluation with Continuous Scores

About

Computerized Adaptive Testing (CAT) has proven effective for efficient LLM evaluation on multiple-choice benchmarks, but modern LLM evaluation increasingly relies on generation tasks where outputs are scored continuously rather than marked correct/incorrect. We present a principled extension of IRT-based adaptive testing to continuous bounded scores (ROUGE, BLEU, LLM-as-a-Judge) by replacing the Bernoulli response distribution with a heteroskedastic normal distribution. Building on this, we introduce an uncertainty aware ranker with adaptive stopping criteria that achieves reliable model ranking while testing as few items and as cheaply as possible. We validate our method on five benchmarks spanning n-gram-based, embedding-based, and LLM-as-judge metrics. Our method uses 2% of the items while improving ranking correlation by 0.12 {\tau} over random sampling, with 95% accuracy on confident predictions.

Esma Balk{\i}r, Alice Pernthaller, Marco Basaldella, Jos\'e Hern\'andez-Orallo, Nigel Collier• 2026

Related benchmarks

Task	Dataset	Result
Model Ranking	BioLaySumm ROUGE-L (test)	Kendall Tau (τ)0.957	2
Model Ranking	BioLaySumm BERTScore (test)	Kendall's Tau0.903	2
Model Ranking	BioLaySumm FKGL (test)	Kendall Tau (τ)0.8	2
Model Ranking	TruthfulQA LLM-Judge (test)	Kendall's Tau0.49	2
Model Ranking	TruthfulQA BERTScore (test)	Kendall's Tau0.45	2
Model Ranking	FLORES BLEU (test)	Kendall Tau (τ)0.803	2
Model Ranking	FLORES COMET (test)	Kendall Tau0.677	2
Model Ranking	GovReport ROUGE-L (test)	Kendall Tau (τ)0.8	2
Model Ranking	Nemotron F1 (test)	Kendall Tau0.673	2

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord