Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Cross Modal Retrieval with Querybank Normalisation

About

Profiting from large-scale training datasets, advances in neural architecture design and efficient inference, joint embeddings have become the dominant approach for tackling cross-modal retrieval. In this work we first show that, despite their effectiveness, state-of-the-art joint embeddings suffer significantly from the longstanding "hubness problem" in which a small number of gallery embeddings form the nearest neighbours of many queries. Drawing inspiration from the NLP literature, we formulate a simple but effective framework called Querybank Normalisation (QB-Norm) that re-normalises query similarities to account for hubs in the embedding space. QB-Norm improves retrieval performance without requiring retraining. Differently from prior work, we show that QB-Norm works effectively without concurrent access to any test set queries. Within the QB-Norm framework, we also propose a novel similarity normalisation method, the Dynamic Inverted Softmax, that is significantly more robust than existing approaches. We showcase QB-Norm across a range of cross modal retrieval models and benchmarks where it consistently enhances strong baselines beyond the state of the art. Code is available at https://vladbogo.github.io/QB-Norm/.

Simion-Vlad Bogolin, Ioana Croitoru, Hailin Jin, Yang Liu, Samuel Albanie• 2021

Related benchmarks

TaskDatasetResultRank
Text-to-Video RetrievalDiDeMo (test)
R@143.3
376
Text-to-Video RetrievalDiDeMo
R@10.435
360
Text-to-Video RetrievalLSMDC (test)
R@122.4
225
Text-to-Video RetrievalMSVD
R@147.6
218
Text-to-Video RetrievalMSR-VTT (1k-A)
R@1083
211
Text-to-Video RetrievalMSVD (test)
R@148
204
Text-to-Video RetrievalMSRVTT
R@133.3
98
Text-to-Video RetrievalVATEX (test)
R@158.8
62
Text-to-Video RetrievalMSR-VTT 1K (val)
R@147.2
38
Text-Video RetrievalMSRVTT Ret (test)
R@147.2
14
Showing 10 of 10 rows

Other info

Code

Follow for update