Cross Modal Retrieval with Querybank Normalisation
About
Profiting from large-scale training datasets, advances in neural architecture design and efficient inference, joint embeddings have become the dominant approach for tackling cross-modal retrieval. In this work we first show that, despite their effectiveness, state-of-the-art joint embeddings suffer significantly from the longstanding "hubness problem" in which a small number of gallery embeddings form the nearest neighbours of many queries. Drawing inspiration from the NLP literature, we formulate a simple but effective framework called Querybank Normalisation (QB-Norm) that re-normalises query similarities to account for hubs in the embedding space. QB-Norm improves retrieval performance without requiring retraining. Differently from prior work, we show that QB-Norm works effectively without concurrent access to any test set queries. Within the QB-Norm framework, we also propose a novel similarity normalisation method, the Dynamic Inverted Softmax, that is significantly more robust than existing approaches. We showcase QB-Norm across a range of cross modal retrieval models and benchmarks where it consistently enhances strong baselines beyond the state of the art. Code is available at https://vladbogo.github.io/QB-Norm/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Video Retrieval | DiDeMo (test) | R@143.3 | 376 | |
| Text-to-Video Retrieval | DiDeMo | R@10.435 | 360 | |
| Text-to-Video Retrieval | LSMDC (test) | R@122.4 | 225 | |
| Text-to-Video Retrieval | MSVD | R@147.6 | 218 | |
| Text-to-Video Retrieval | MSR-VTT (1k-A) | R@1083 | 211 | |
| Text-to-Video Retrieval | MSVD (test) | R@148 | 204 | |
| Text-to-Video Retrieval | MSRVTT | R@133.3 | 98 | |
| Text-to-Video Retrieval | VATEX (test) | R@158.8 | 62 | |
| Text-to-Video Retrieval | MSR-VTT 1K (val) | R@147.2 | 38 | |
| Text-Video Retrieval | MSRVTT Ret (test) | R@147.2 | 14 |