Effort-Optimized, Accuracy-Driven Labelling and Validation of Test Inputs for DL Systems: A Mixed-Integer Linear Programming Approach

About

Software systems increasingly include AI components based on deep learning (DL). Reliable testing of such systems requires near-perfect test-input validity and label accuracy, with minimal human effort. Yet, the DL community has largely overlooked the need to build highly accurate datasets with minimal effort, since DL training is generally tolerant of labelling errors. This challenge, instead, reflects concerns more familiar to software engineering, where a central goal is to construct high-accuracy test inputs, with accuracy as close to 100% as possible, while keeping associated costs in check. In this article we introduce OPAL, a human-assisted labelling method that can be configured to target a desired accuracy level while minimizing the manual effort required for labelling. The main contribution of OPAL is a mixed-integer linear programming (MILP) formulation that minimizes labelling effort subject to a specified accuracy target. To evaluate OPAL we instantiate it for two tasks in the context of testing vision systems: automatic labelling of test inputs and automated validation of test inputs. Our evaluation, based on more than 2500 experiments performed on nine datasets, comparing OPAL with eight baseline methods, shows that OPAL, relying on its MILP formulation, achieves an average accuracy of 98.8%, while cutting manual labelling by more than half. OPAL significantly outperforms automated labelling baselines in labelling accuracy across all nine datasets, when all methods are provided with the same manual-labelling budget. For automated test-input validation, on average, OPAL reduces manual effort by 28.8% while achieving 4.5% higher accuracy than the SOTA test-input validation baselines. Finally, we show that augmenting OPAL with an active-learning loop leads to an additional 4.5% reduction in required manual labelling, without compromising accuracy.

Mohammad Hossein Amini, Mehrdad Sabetzadeh, Shiva Nejati• 2025

Related benchmarks

Task	Dataset	Result
Image Classification	Fashion MNIST	Accuracy98.8	240
Labelling Accuracy	CIFAR-10	Accuracy98.4	5
Labelling Accuracy	MNIST	Accuracy99.5	5
Labelling Accuracy	SVHN	Accuracy98.7	5
Labelling Accuracy	CelebA Hair	Accuracy99	5
Labelling Accuracy	CelebA M/F	Accuracy (CelebA M/F)99.1	5
Labelling Accuracy	Synthetic Pub1	Accuracy97.9	5
Labelling Accuracy	Synthetic Pub 2	Accuracy99.2	5
Labelling Accuracy	Industry	Accuracy98.6	5

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord