ELFS: Label-Free Coreset Selection with Proxy Training Dynamics

About

High-quality human-annotated data is crucial for modern deep learning pipelines, yet the human annotation process is both costly and time-consuming. Given a constrained human labeling budget, selecting an informative and representative data subset for labeling can significantly reduce human annotation effort. Well-performing state-of-the-art (SOTA) coreset selection methods require ground truth labels over the whole dataset, failing to reduce the human labeling burden. Meanwhile, SOTA label-free coreset selection methods deliver inferior performance due to poor geometry-based difficulty scores. In this paper, we introduce ELFS (Effective Label-Free Coreset Selection), a novel label-free coreset selection method. ELFS significantly improves label-free coreset selection by addressing two challenges: 1) ELFS utilizes deep clustering to estimate training dynamics-based data difficulty scores without ground truth labels; 2) Pseudo-labels introduce a distribution shift in the data difficulty scores, and we propose a simple but effective double-end pruning method to mitigate bias on calculated scores. We evaluate ELFS on four vision benchmarks and show that, given the same vision encoder, ELFS consistently outperforms SOTA label-free baselines. For instance, when using SwAV as the encoder, ELFS outperforms D2 by up to 10.2% in accuracy on ImageNet-1K. We make our code publicly available on GitHub.

Haizhong Zheng, Elisa Tsai, Yifu Lu, Jiachen Sun, Brian R. Bartoldson, Bhavya Kailkhura, Atul Prakash• 2024

Related benchmarks

Task	Dataset	Result
Image Classification	SUN397 (test)	Top-1 Accuracy60.6	251
Image Classification	Food-101 (test)	Accuracy77.2	145
Image Classification	CIFAR-100-C 30% corrupted (test)	Accuracy73.1	45
Image Classification	CIFAR-100-LT balanced imbalance factor 0.1 (test)	Accuracy56	45
Image Classification	CIFAR-100 LT IF=0.01 (test)	Accuracy35	45
Image Classification	Tiny-ImageNet-C 30% corrupted (test)	Accuracy40.9	45
Image Classification	Caltech-101 naturally imbalanced (test)	Accuracy75.7	45
Image Classification	CIFAR-100 (test)	Accuracy (k=80)61.9	17
Image Classification	ImageNet 1k (test)	Accuracy (30% Threshold)73.5	9
Image Classification	CIFAR-10 (test)	Accuracy (30%)95.3	9

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord