A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark

About

Representation learning promises to unlock deep learning for the long tail of vision tasks without expensive labelled datasets. Yet, the absence of a unified evaluation for general visual representations hinders progress. Popular protocols are often too constrained (linear classification), limited in diversity (ImageNet, CIFAR, Pascal-VOC), or only weakly related to representation quality (ELBO, reconstruction error). We present the Visual Task Adaptation Benchmark (VTAB), which defines good representations as those that adapt to diverse, unseen tasks with few examples. With VTAB, we conduct a large-scale study of many popular publicly-available representation learning algorithms. We carefully control confounders such as architecture and tuning budget. We address questions like: How effective are ImageNet representations beyond standard natural datasets? How do representations trained via generative and discriminative models compare? To what extent can self-supervision replace labels? And, how close are we to general visual representations?

Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, Lucas Beyer, Olivier Bachem, Michael Tschannen, Marcin Michalski, Olivier Bousquet, Sylvain Gelly, Neil Houlsby• 2019

Related benchmarks

Task	Dataset	Result
Object Detection	COCO 2017 (val)	AP35.3	2930
Image Classification	VTAB 1k (test)	Accuracy (Natural)59.29	145
Image Classification	VTAB-1K 1.0 (test)	Natural Accuracy65.2	102
Image Classification	VTAB v2 (test)	Mean Accuracy67.5	39
Visual Task Adaptation	VTAB-1k v1 (test)	Mean Accuracy68	34
Fine-grained Visual Categorization	FGVC (CUB-200, NABirds, Oxford Flowers, Stanford Dogs, Stanford Cars) 1.0 (test)	Mean Accuracy88.41	12

Showing 6 of 6 rows

Other info

Code

Follow for update

@wizwand_team Discord