Training Vision Transformers for Image Retrieval

About

Transformers have shown outstanding results for natural language understanding and, more recently, for image classification. We here extend this work and propose a transformer-based approach for image retrieval: we adopt vision transformers for generating image descriptors and train the resulting model with a metric learning objective, which combines a contrastive loss with a differential entropy regularizer. Our results show consistent and significant improvements of transformers over convolution-based approaches. In particular, our method outperforms the state of the art on several public benchmarks for category-level retrieval, namely Stanford Online Product, In-Shop and CUB-200. Furthermore, our experiments on ROxford and RParis also show that, in comparable settings, transformers are competitive for particular object retrieval, especially in the regime of short vector representations and low-resolution images.

Alaaeldin El-Nouby, Natalia Neverova, Ivan Laptev, Herv\'e J\'egou• 2021

Related benchmarks

Task	Dataset	Result
Image Retrieval	CUB-200-2011 (test)	Recall@176.6	251
Image Retrieval	Stanford Online Products (test)	Recall@184.2	231
Image Retrieval	In-shop Clothes Retrieval Dataset	Recall@191.9	120
Image Retrieval	CUB	Recall@176.6	87
Image Retrieval	SOP (test)	Recall@184.2	42
Image Retrieval	R-Oxford Medium	mAP34.5	35
In-shop clothing retrieval	DeepFashion in-shop	Top-1 Accuracy91.9	26
Landmarks retrieval	ROxford Hard 30	mAP15.8	2
Landmarks retrieval	RParis Medium 30	mAP65.8	2
Landmarks retrieval	RParis Hard 30	mAP42	2

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord