Zero-Shot Coreset Selection via Iterative Subspace Sampling

About

Deep learning increasingly relies on massive data with substantial storage, annotation, and training costs. To reduce costs, coreset selection finds a representative subset of data to train models while ideally performing on par with the full data training. To maximize performance, current state-of-the-art coreset methods select data using dataset-specific ground truth labels and training. However, these methodological requirements prevent selection at scale on real-world, unlabeled data. To that end, this paper addresses the selection of coresets that achieve state-of-the-art performance but without using any labels or training on candidate data. Instead, our solution, Zero-Shot Coreset Selection via Iterative Subspace Sampling (ZCore), uses previously-trained foundation models to generate zero-shot, high-dimensional embedding spaces to interpret unlabeled data. ZCore then iteratively quantifies the relative value of all candidate data based on coverage and redundancy in numerous subspace distributions. Finally, ZCore selects a coreset sized for any data budget to train downstream models. We evaluate ZCore on four datasets and outperform several state-of-the-art label-based methods, especially at low data rates that provide the most substantial cost reduction. On ImageNet, ZCore selections for 10% training data achieve a downstream validation accuracy of 53.99%, which outperforms prior label-based methods and removes annotation and training costs for 1.15 million images. Our paper's code is publicly available at https://github.com/voxel51/zcore.

Brent A. Griffin, Jacob Marks, Jason J. Corso• 2024

Related benchmarks

Task	Dataset	Result
Image Classification	SUN397 (test)	Top-1 Accuracy59.4	251
Image Classification	Food-101 (test)	Accuracy76.7	145
Image Classification	CIFAR-100-LT balanced imbalance factor 0.1 (test)	Accuracy55.5	45
Image Classification	CIFAR-100 LT IF=0.01 (test)	Accuracy32.5	45
Image Classification	Caltech-101 naturally imbalanced (test)	Accuracy71.5	45
Image Classification	Tiny-ImageNet-C 30% corrupted (test)	Accuracy37.8	45
Image Classification	CIFAR-100-C 30% corrupted (test)	Accuracy68.4	45
Classification	OpenML 67 datasets CC18 (10-fold cross-validation average)	F1 Score68	42
Image Classification	CIFAR-100 (test)	Accuracy (k=80)61.9	17
Image Classification	ImageNet 1k (test)	Accuracy (30% Threshold)72.3	9

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord