Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Zero-Shot Coreset Selection via Iterative Subspace Sampling

About

Deep learning increasingly relies on massive data with substantial storage, annotation, and training costs. To reduce costs, coreset selection finds a representative subset of data to train models while ideally performing on par with the full data training. To maximize performance, current state-of-the-art coreset methods select data using dataset-specific ground truth labels and training. However, these methodological requirements prevent selection at scale on real-world, unlabeled data. To that end, this paper addresses the selection of coresets that achieve state-of-the-art performance but without using any labels or training on candidate data. Instead, our solution, Zero-Shot Coreset Selection via Iterative Subspace Sampling (ZCore), uses previously-trained foundation models to generate zero-shot, high-dimensional embedding spaces to interpret unlabeled data. ZCore then iteratively quantifies the relative value of all candidate data based on coverage and redundancy in numerous subspace distributions. Finally, ZCore selects a coreset sized for any data budget to train downstream models. We evaluate ZCore on four datasets and outperform several state-of-the-art label-based methods, especially at low data rates that provide the most substantial cost reduction. On ImageNet, ZCore selections for 10% training data achieve a downstream validation accuracy of 53.99%, which outperforms prior label-based methods and removes annotation and training costs for 1.15 million images. Our paper's code is publicly available at https://github.com/voxel51/zcore.

Brent A. Griffin, Jacob Marks, Jason J. Corso• 2024

Related benchmarks

TaskDatasetResultRank
Image ClassificationSUN397 (test)
Top-1 Accuracy59.4
231
Image ClassificationFood-101 (test)
Accuracy76.7
145
Image ClassificationCIFAR-100-LT balanced imbalance factor 0.1 (test)
Accuracy55.5
45
Image ClassificationCIFAR-100 LT IF=0.01 (test)
Accuracy32.5
45
Image ClassificationCaltech-101 naturally imbalanced (test)
Accuracy71.5
45
Image ClassificationTiny-ImageNet-C 30% corrupted (test)
Accuracy37.8
45
Image ClassificationCIFAR-100-C 30% corrupted (test)
Accuracy68.4
45
ClassificationOpenML 67 datasets CC18 (10-fold cross-validation average)
F1 Score68
42
Image ClassificationCIFAR-100 (test)
Accuracy (k=30)76
12
Image ClassificationImageNet 1k (test)
Accuracy (30% Threshold)72.3
9
Showing 10 of 17 rows

Other info

Follow for update