Learning Representations by Predicting Bags of Visual Words
About
Self-supervised representation learning targets to learn convnet-based image representations from unlabeled data. Inspired by the success of NLP methods in this area, in this work we propose a self-supervised approach based on spatially dense image descriptions that encode discrete visual concepts, here called visual words. To build such discrete representations, we quantize the feature maps of a first pre-trained self-supervised convnet, over a k-means based vocabulary. Then, as a self-supervised task, we train another convnet to predict the histogram of visual words of an image (i.e., its Bag-of-Words representation) given as input a perturbed version of that image. The proposed task forces the convnet to learn perturbation-invariant and context-aware image features, useful for downstream image understanding tasks. We extensively evaluate our method and demonstrate very strong empirical results, e.g., our pre-trained self-supervised representations transfer better on detection task and similarly on classification over classes "unseen" during pre-training, when compared to the supervised case. This also shows that the process of image discretization into visual words can provide the basis for very powerful self-supervised approaches in the image domain, thus allowing further connections to be made to related methods from the NLP domain that have been extremely successful so far.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet-1k (val) | Top-1 Acc62.1 | 706 | |
| Image Classification | Places205 (val) | Top-1 Accuracy51.1 | 68 | |
| Image Classification | VOC 2007 (test) | mAP79.3 | 67 | |
| Object Detection | VOC 2007 (test) | AP@5081.3 | 52 | |
| Image Classification | Places 205-way (test) | Top-1 Accuracy51.1 | 38 | |
| Classification | VOC07 (test) | Accuracy79.3 | 29 | |
| Object Detection | PASCAL VOC 2007 (test) | AP55.8 | 18 | |
| Object Detection | VOC 07+12 train val (test) | AP5081.3 | 12 | |
| Object Detection | VOC 07+12 (trainval) | AP5081.3 | 9 |