Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Active Testing: Sample-Efficient Model Evaluation

About

We introduce a new framework for sample-efficient model evaluation that we call active testing. While approaches like active learning reduce the number of labels needed for model training, existing literature largely ignores the cost of labeling test data, typically unrealistically assuming large test sets for model evaluation. This creates a disconnect to real applications, where test labels are important and just as expensive, e.g. for optimizing hyperparameters. Active testing addresses this by carefully selecting the test points to label, ensuring model evaluation is sample-efficient. To this end, we derive theoretically-grounded and intuitive acquisition strategies that are specifically tailored to the goals of active testing, noting these are distinct to those of active learning. As actively selecting labels introduces a bias; we further show how to remove this bias while reducing the variance of the estimator at the same time. Active testing is easy to implement and can be applied to any supervised machine learning method. We demonstrate its effectiveness on models including WideResNets and Gaussian processes on datasets including Fashion-MNIST and CIFAR-100.

Jannik Kossen, Sebastian Farquhar, Yarin Gal, Tom Rainforth• 2021

Related benchmarks

TaskDatasetResultRank
Performance EstimationJigsaw
MAE0.027
198
Performance EstimationMMLU
MAE0.036
198
Performance EstimationToxicChat
MAE0.017
198
Performance EstimationSVAMP
MAE0.024
198
Performance EstimationStrategyQA
MAE0.024
197
Performance EstimationGSM8K
MAE0.03
197
Performance EstimationDIVE
MAE0.033
189
Performance EstimationGQA
MAE0.031
184
Performance EstimationDICES
MAE0.053
136
Population property estimationDICES
Bias (MAE)0.056
92
Showing 10 of 11 rows

Other info

Follow for update