Query-efficient model evaluation using cached responses

About

Evaluating a new model on an existing benchmark is often necessary to understand its behavior before deployment. For modern evaluation frameworks, generating and evaluating a response for all queries can be prohibitively expensive. In practice, responses from previously-evaluated models are often cached -- creating a potential opportunity to use this additional information to decrease the number of queries required to accurately evaluate a new model. In this paper, we introduce an approach for predicting benchmark performance that leverages cached model responses based on the Data Kernel Perspective Space (DKPS), a method for quantifying the relationship between models in the black-box setting. Theoretically, we show that DKPS-based methods are query-efficient under certain conditions. Empirically, we demonstrate that DKPS-based methods achieve the same mean absolute error as baselines with a substantially decreased query budget. We conclude by proposing an offline method for selecting a set of queries that maximizes the goodness-of-fit on reference models, improving prediction accuracy over random query selection.

Hayden Helm, Ben Johnson, Carey Priebe• 2026

Related benchmarks

Task	Dataset	Result
Legal Reasoning	LegalBench	MAE0.033	16
Machine Translation	WMT'14	MAE0.013	16
Medical Question Answering	MedQA	MAE0.038	16
Mathematics	MATH	MAE0.042	16

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord