All the attention you need: Global-local, spatial-channel attention for image retrieval

About

We address representation learning for large-scale instance-level image retrieval. Apart from backbone, training pipelines and loss functions, popular approaches have focused on different spatial pooling and attention mechanisms, which are at the core of learning a powerful global image representation. There are different forms of attention according to the interaction of elements of the feature tensor (local and global) and the dimensions where it is applied (spatial and channel). Unfortunately, each study addresses only one or two forms of attention and applies it to different problems like classification, detection or retrieval. We present global-local attention module (GLAM), which is attached at the end of a backbone network and incorporates all four forms of attention: local and global, spatial and channel. We obtain a new feature tensor and, by spatial pooling, we learn a powerful embedding for image retrieval. Focusing on global descriptors, we provide empirical evidence of the interaction of all forms of attention and improve the state of the art on standard benchmarks.

Chull Hwan Song, Hye Joo Han, Yannis Avrithis• 2021

Related benchmarks

Task	Dataset	Result
Image Retrieval	Revisited Oxford (ROxf) (Medium)	mAP72.2	124
Image Retrieval	Revisited Paris (RPar) (Hard)	mAP65.6	115
Image Retrieval	Oxford 5k	mAP90.9	100
Image Retrieval	Revisited Paris (RPar) (Medium)	mAP77.5	100
Image Retrieval	Revisited Oxford (ROxf) (Hard)	mAP49.6	81
Image Retrieval	Paris Revisited (Medium)	mAP83	63
Image Retrieval	Paris6k	mAP94.1	45
Image Retrieval	Oxford Revisited (Hard)	mAP39.5	33
Image Retrieval	RPar+R1M Medium	mAP58.6	31
Image Retrieval	RPar+R1M Hard	mAP33.3	31

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord