Learning Token-based Representation for Image Retrieval
About
In image retrieval, deep local features learned in a data-driven manner have been demonstrated effective to improve retrieval performance. To realize efficient retrieval on large image database, some approaches quantize deep local features with a large codebook and match images with aggregated match kernel. However, the complexity of these approaches is non-trivial with large memory footprint, which limits their capability to jointly perform feature learning and aggregation. To generate compact global representations while maintaining regional matching capability, we propose a unified framework to jointly learn local feature representation and aggregation. In our framework, we first extract deep local features using CNNs. Then, we design a tokenizer module to aggregate them into a few visual tokens, each corresponding to a specific visual pattern. This helps to remove background noise, and capture more discriminative regions in the image. Next, a refinement block is introduced to enhance the visual tokens with self-attention and cross-attention. Finally, different visual tokens are concatenated to generate a compact global representation. The whole framework is trained end-to-end with image-level labels. Extensive experiments are conducted to evaluate our approach, which outperforms the state-of-the-art methods on the Revisited Oxford and Paris datasets.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Retrieval | Revisited Oxford (ROxf) (Medium) | mAP60.8 | 124 | |
| Image Retrieval | Revisited Paris (RPar) (Hard) | mAP54.8 | 115 | |
| Image Retrieval | Oxford 5k | mAP81.2 | 100 | |
| Image Retrieval | Revisited Oxford (ROxf) (Hard) | mAP37.3 | 81 | |
| Image Retrieval | Paris Revisited (Medium) | mAP75.8 | 63 | |
| Image Retrieval | Paris6k | mAP89.6 | 45 | |
| Image Retrieval | RPar+R1M Medium | mAP44.1 | 31 | |
| Image Retrieval | RPar+R1M Hard | mAP19.7 | 31 | |
| Image Retrieval | ROxf + R1M | Retrieval Latency (s)0.1042 | 10 | |
| Image Retrieval | RPar + R1M | Memory (GB)0.1 | 10 |