Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking
About
Multimodal retrieval still leans on embedding-based models like CLIP for fast vector search over pre-computed image embeddings. Yet, unlike text retrieval, where joint-encoder rerankers are standard, comparable vision-language rerankers are largely absent. We find that seminal joint encoders such as BLIP are severely bottlenecked by an expensive visual feature-extraction stage, preventing practical deployment at scale. Motivated by this bottleneck, we introduce EDJE, an Efficient Discriminative Joint Encoder that precomputes vision tokens offline and compresses them via a lightweight attention-based adapter, so online inference runs only a compact joint encoder over a small set of visual tokens plus the text. EDJE preserves strong retrieval performance while drastically reducing storage and online compute, enabling high-throughput inference. Specifically, EDJE processes 50k image--text pairs/second while requiring 49kB of disk storage per image, matching prior art on Flickr (zero-shot) and COCO (fine-tuned) retrieval.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text Retrieval | Flickr30k Zero-shot (test) | Recall@187.8 | 30 | |
| Image-to-Text Retrieval | MS-COCO fine-tuned | R@181 | 11 | |
| Text-to-Image Retrieval | MS-COCO fine-tuned | R@164.9 | 11 | |
| Image-to-Text Retrieval | Flickr30K zero-shot | R@196.5 | 8 | |
| Image-to-Text Retrieval | COCO Full (train+val) | R@569.86 | 2 | |
| Image-to-Text Retrieval | Flickr (full) | R@592.4 | 2 | |
| Text-to-Image Retrieval | COCO Full (train+val) | Recall@552.23 | 2 | |
| Text-to-Image Retrieval | Flickr (full) | R@578.32 | 2 |