Zero-Shot Coreset Selection via Iterative Subspace Sampling
About
Deep learning increasingly relies on massive data with substantial storage, annotation, and training costs. To reduce costs, coreset selection finds a representative subset of data to train models while ideally performing on par with the full data training. To maximize performance, current state-of-the-art coreset methods select data using dataset-specific ground truth labels and training. However, these methodological requirements prevent selection at scale on real-world, unlabeled data. To that end, this paper addresses the selection of coresets that achieve state-of-the-art performance but without using any labels or training on candidate data. Instead, our solution, Zero-Shot Coreset Selection via Iterative Subspace Sampling (ZCore), uses previously-trained foundation models to generate zero-shot, high-dimensional embedding spaces to interpret unlabeled data. ZCore then iteratively quantifies the relative value of all candidate data based on coverage and redundancy in numerous subspace distributions. Finally, ZCore selects a coreset sized for any data budget to train downstream models. We evaluate ZCore on four datasets and outperform several state-of-the-art label-based methods, especially at low data rates that provide the most substantial cost reduction. On ImageNet, ZCore selections for 10% training data achieve a downstream validation accuracy of 53.99%, which outperforms prior label-based methods and removes annotation and training costs for 1.15 million images. Our paper's code is publicly available at https://github.com/voxel51/zcore.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | SUN397 (test) | Top-1 Accuracy59.4 | 231 | |
| Image Classification | Food-101 (test) | Accuracy76.7 | 145 | |
| Image Classification | CIFAR-100-LT balanced imbalance factor 0.1 (test) | Accuracy55.5 | 45 | |
| Image Classification | CIFAR-100 LT IF=0.01 (test) | Accuracy32.5 | 45 | |
| Image Classification | Caltech-101 naturally imbalanced (test) | Accuracy71.5 | 45 | |
| Image Classification | Tiny-ImageNet-C 30% corrupted (test) | Accuracy37.8 | 45 | |
| Image Classification | CIFAR-100-C 30% corrupted (test) | Accuracy68.4 | 45 | |
| Classification | OpenML 67 datasets CC18 (10-fold cross-validation average) | F1 Score68 | 42 | |
| Image Classification | CIFAR-100 (test) | Accuracy (k=30)76 | 12 | |
| Image Classification | ImageNet 1k (test) | Accuracy (30% Threshold)72.3 | 9 |