Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking

About

Multimodal retrieval still leans on embedding-based models like CLIP for fast vector search over pre-computed image embeddings. Yet, unlike text retrieval, where joint-encoder rerankers are standard, comparable vision-language rerankers are largely absent. We find that seminal joint encoders such as BLIP are severely bottlenecked by an expensive visual feature-extraction stage, preventing practical deployment at scale. Motivated by this bottleneck, we introduce EDJE, an Efficient Discriminative Joint Encoder that precomputes vision tokens offline and compresses them via a lightweight attention-based adapter, so online inference runs only a compact joint encoder over a small set of visual tokens plus the text. EDJE preserves strong retrieval performance while drastically reducing storage and online compute, enabling high-throughput inference. Specifically, EDJE processes 50k image--text pairs/second while requiring 49kB of disk storage per image, matching prior art on Flickr (zero-shot) and COCO (fine-tuned) retrieval.

Mitchell Keren Taraday, Shahaf Wagner, Chaim Baskin• 2025

Related benchmarks

Task	Dataset	Result
Text Retrieval	Flickr30k Zero-shot (test)	Recall@187.8	30
Image-to-Text Retrieval	MS-COCO fine-tuned	R@181	11
Text-to-Image Retrieval	MS-COCO fine-tuned	R@164.9	11
Image-to-Text Retrieval	Flickr30K zero-shot	R@196.5	8
Image-to-Text Retrieval	COCO Full (train+val)	R@569.86	2
Image-to-Text Retrieval	Flickr (full)	R@592.4	2
Text-to-Image Retrieval	COCO Full (train+val)	Recall@552.23	2
Text-to-Image Retrieval	Flickr (full)	R@578.32	2

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord