Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LightningDOT: Pre-training Visual-Semantic Embeddings for Real-Time Image-Text Retrieval

About

Multimodal pre-training has propelled great advancement in vision-and-language research. These large-scale pre-trained models, although successful, fatefully suffer from slow inference speed due to enormous computation cost mainly from cross-modal attention in Transformer architecture. When applied to real-life applications, such latency and computation demand severely deter the practical use of pre-trained models. In this paper, we study Image-text retrieval (ITR), the most mature scenario of V+L application, which has been widely studied even prior to the emergence of recent pre-trained models. We propose a simple yet highly effective approach, LightningDOT that accelerates the inference time of ITR by thousands of times, without sacrificing accuracy. LightningDOT removes the time-consuming cross-modal attention by pre-training on three novel learning objectives, extracting feature indexes offline, and employing instant dot-product matching with further re-ranking, which significantly speeds up retrieval process. In fact, LightningDOT achieves new state of the art across multiple ITR benchmarks such as Flickr30k, COCO and Multi30K, outperforming existing pre-trained models that consume 1000x magnitude of computational hours. Code and pre-training checkpoints are available at https://github.com/intersun/LightningDOT.

Siqi Sun, Yen-Chun Chen, Linjie Li, Shuohang Wang, Yuwei Fang, Jingjing Liu• 2021

Related benchmarks

TaskDatasetResultRank
Image-to-Text RetrievalFlickr30K 1K (test)
R@183.9
439
Text-to-Image RetrievalFlickr30k (test)
Recall@169.9
423
Text-to-Image RetrievalFlickr30K 1K (test)
R@169.9
375
Image-to-Text RetrievalFlickr30k (test)
R@183.9
370
Image-to-Text RetrievalMS-COCO 5K (test)
R@160.1
299
Text-to-Image RetrievalMSCOCO 5K (test)
R@145.8
286
Image-to-Text RetrievalFlickr30K 1K Karpathy (test)
R@183.9
59
Image-to-Text RetrievalCOCO-CN--
48
Image-to-Text RetrievalMSCOCO 5K (test)
R@160.1
46
Image-Text RetrievalFlickr30k (test)--
21
Showing 10 of 14 rows

Other info

Follow for update