Dataset Pruning: Reducing Training Data by Examining Generalization Influence

About

The great success of deep learning heavily relies on increasingly larger training data, which comes at a price of huge computational and infrastructural costs. This poses crucial questions that, do all training data contribute to model's performance? How much does each individual training sample or a sub-training-set affect the model's generalization, and how to construct the smallest subset from the entire training data as a proxy training set without significantly sacrificing the model's performance? To answer these, we propose dataset pruning, an optimization-based sample selection method that can (1) examine the influence of removing a particular set of training samples on model's generalization ability with theoretical guarantee, and (2) construct the smallest subset of training data that yields strictly constrained generalization gap. The empirically observed generalization gap of dataset pruning is substantially consistent with our theoretical expectations. Furthermore, the proposed method prunes 40% training examples on the CIFAR-10 dataset, halves the convergence time with only 1.3% test accuracy decrease, which is superior to previous score-based sample selection methods.

Shuo Yang, Zeke Xie, Hanyu Peng, Min Xu, Mingming Sun, Ping Li• 2022

Related benchmarks

Task	Dataset	Result
Graph Classification	MUTAG	Accuracy87.9	1103
Image Classification	CIFAR-100	Accuracy77.2	691
Image Classification	CIFAR100	Accuracy77.2	301
Image Classification	CIFAR-10	Accuracy94.9	246
Graph Classification	ogbg-molpcba (test)	AP27.7	212
Image Classification	CIFAR-100	Accuracy77.2	204
Image Classification	CIFAR-10 (test)	Test Accuracy94.9	154
Graph Classification	DHFR	Accuracy75.6	145
Image Classification	CIFAR10	Accuracy94.9	143
Image Classification	CIFAR-100 (test)	Acc77.2	110

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord