V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval
About
Multimodal Large Language Models (MLLMs) have recently been applied to universal multimodal retrieval, where Chain-of-Thought (CoT) reasoning improves candidate reranking. However, existing approaches remain largely language-driven, relying on static visual encodings and lacking the ability to actively verify fine-grained visual evidence, which often leads to speculative reasoning in visually ambiguous cases. We propose V-Retrver, an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection. V-Retrver enables an MLLM to selectively acquire visual evidence during reasoning via external visual tools, performing a multimodal interleaved reasoning process that alternates between hypothesis generation and targeted visual verification.To train such an evidence-gathering retrieval agent, we adopt a curriculum-based learning strategy combining supervised reasoning activation, rejection-based refinement, and reinforcement learning with an evidence-aligned objective. Experiments across multiple multimodal retrieval benchmarks demonstrate consistent improvements in retrieval accuracy (with 23.0% improvements on average), perception-driven reasoning reliability, and generalization.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Composed Image Retrieval | CIRCO | mAP@548.2 | 63 | |
| Visual Question Answering | InfoSeek (test) | Accuracy31.9 | 60 | |
| Visual Question Answering | E-VQA (test) | Accuracy58 | 56 | |
| Multi-modal Retrieval | M-BEIR (test) | Average Recall69.7 | 36 | |
| Image-text-to-text retrieval | InfoSeek | Recall@570.3 | 20 | |
| Multimodal Retrieval | MT-FIQ | Recall@568.3 | 15 | |
| Image-text-to-multimodal retrieval | OVEN | R@575.3 | 14 | |
| Image-text-to-text retrieval | OVEN | Recall@557.8 | 14 | |
| Visual Question Answering | OKVQA (test) | Accuracy65.7 | 11 | |
| Contextual Image Retrieval | VIST | R@13.12e+3 | 10 |