Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking

About

Multimodal retrieval still leans on embedding-based models like CLIP for fast vector search over pre-computed image embeddings. Yet, unlike text retrieval, where joint-encoder rerankers are standard, comparable vision-language rerankers are largely absent. We find that seminal joint encoders such as BLIP are severely bottlenecked by an expensive visual feature-extraction stage, preventing practical deployment at scale. Motivated by this bottleneck, we introduce EDJE, an Efficient Discriminative Joint Encoder that precomputes vision tokens offline and compresses them via a lightweight attention-based adapter, so online inference runs only a compact joint encoder over a small set of visual tokens plus the text. EDJE preserves strong retrieval performance while drastically reducing storage and online compute, enabling high-throughput inference. Specifically, EDJE processes 50k image--text pairs/second while requiring 49kB of disk storage per image, matching prior art on Flickr (zero-shot) and COCO (fine-tuned) retrieval.

Mitchell Keren Taraday, Shahaf Wagner, Chaim Baskin• 2025

Related benchmarks

TaskDatasetResultRank
Text RetrievalFlickr30k Zero-shot (test)
Recall@187.8
30
Image-to-Text RetrievalMS-COCO fine-tuned
R@181
11
Text-to-Image RetrievalMS-COCO fine-tuned
R@164.9
11
Image-to-Text RetrievalFlickr30K zero-shot
R@196.5
8
Image-to-Text RetrievalCOCO Full (train+val)
R@569.86
2
Image-to-Text RetrievalFlickr (full)
R@592.4
2
Text-to-Image RetrievalCOCO Full (train+val)
Recall@552.23
2
Text-to-Image RetrievalFlickr (full)
R@578.32
2
Showing 8 of 8 rows

Other info

Follow for update