Let Go of Your Labels with Unsupervised Transfer
About
Foundation vision-language models have enabled remarkable zero-shot transferability of the pre-trained representations to a wide range of downstream tasks. However, to solve a new task, zero-shot transfer still necessitates human guidance to define visual categories that appear in the data. Here, we show that fully unsupervised transfer emerges when searching for the labeling of a dataset that induces maximal margin classifiers in representation spaces of different foundation models. We present TURTLE, a fully unsupervised method that effectively employs this guiding principle to uncover the underlying labeling of a downstream dataset without any supervision and task-specific representation learning. We evaluate TURTLE on a diverse benchmark suite of 26 datasets and show that it achieves new state-of-the-art unsupervised performance. Furthermore, TURTLE, although being fully unsupervised, outperforms zero-shot transfer baselines on a wide range of datasets. In particular, TURTLE matches the average performance of CLIP zero-shot on 26 datasets by employing the same representation space, spanning a wide range of architectures and model sizes. By guiding the search for the underlying labeling using the representation spaces of two foundation models, TURTLE surpasses zero-shot transfer and unsupervised prompt tuning baselines, demonstrating the surprising power and effectiveness of unsupervised transfer.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | Food-101 | Accuracy92.2 | 494 | |
| Image Clustering | CIFAR-10 | NMI0.929 | 243 | |
| Image Classification | ImageNet | Accuracy72.9 | 184 | |
| Clustering | MNIST (test) | -- | 122 | |
| Clustering | CIFAR-100 (test) | ACC89.1 | 110 | |
| Clustering | Fashion MNIST | NMI72.3 | 95 | |
| Clustering | CIFAR100 | Clustering Accuracy80.6 | 11 | |
| Clustering | MNIST | ACC0.573 | 5 | |
| Clustering | Food101 (test) | Accuracy92.8 | 4 | |
| Clustering | Birdsnap (test) | Accuracy67.8 | 4 |