Training Vision Transformers for Image Retrieval
About
Transformers have shown outstanding results for natural language understanding and, more recently, for image classification. We here extend this work and propose a transformer-based approach for image retrieval: we adopt vision transformers for generating image descriptors and train the resulting model with a metric learning objective, which combines a contrastive loss with a differential entropy regularizer. Our results show consistent and significant improvements of transformers over convolution-based approaches. In particular, our method outperforms the state of the art on several public benchmarks for category-level retrieval, namely Stanford Online Product, In-Shop and CUB-200. Furthermore, our experiments on ROxford and RParis also show that, in comparable settings, transformers are competitive for particular object retrieval, especially in the regime of short vector representations and low-resolution images.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Retrieval | CUB-200-2011 (test) | Recall@176.6 | 251 | |
| Image Retrieval | Stanford Online Products (test) | Recall@184.2 | 220 | |
| Image Retrieval | In-shop Clothes Retrieval Dataset | Recall@191.9 | 120 | |
| Image Retrieval | CUB | Recall@176.6 | 87 | |
| Image Retrieval | SOP (test) | Recall@184.2 | 42 | |
| Image Retrieval | R-Oxford Medium | mAP34.5 | 35 | |
| In-shop clothing retrieval | DeepFashion in-shop | Top-1 Accuracy91.9 | 26 | |
| Landmarks retrieval | ROxford Hard 30 | mAP15.8 | 2 | |
| Landmarks retrieval | RParis Medium 30 | mAP65.8 | 2 | |
| Landmarks retrieval | RParis Hard 30 | mAP42 | 2 |