Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval

About

Multimodal Large Language Models (MLLMs) have recently been applied to universal multimodal retrieval, where Chain-of-Thought (CoT) reasoning improves candidate reranking. However, existing approaches remain largely language-driven, relying on static visual encodings and lacking the ability to actively verify fine-grained visual evidence, which often leads to speculative reasoning in visually ambiguous cases. We propose V-Retrver, an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection. V-Retrver enables an MLLM to selectively acquire visual evidence during reasoning via external visual tools, performing a multimodal interleaved reasoning process that alternates between hypothesis generation and targeted visual verification.To train such an evidence-gathering retrieval agent, we adopt a curriculum-based learning strategy combining supervised reasoning activation, rejection-based refinement, and reinforcement learning with an evidence-aligned objective. Experiments across multiple multimodal retrieval benchmarks demonstrate consistent improvements in retrieval accuracy (with 23.0% improvements on average), perception-driven reasoning reliability, and generalization.

Dongyang Chen, Chaoyang Wang, Dezhao Su, Xi Xiao, Zeyu Zhang, Jing Xiong, Qing Li, Yuzhang Shang, Shichao Kan• 2026

Related benchmarks

TaskDatasetResultRank
Composed Image RetrievalCIRCO
mAP@548.2
63
Visual Question AnsweringInfoSeek (test)
Accuracy31.9
60
Visual Question AnsweringE-VQA (test)
Accuracy58
56
Multi-modal RetrievalM-BEIR (test)
Average Recall69.7
36
Image-text-to-text retrievalInfoSeek
Recall@570.3
20
Multimodal RetrievalMT-FIQ
Recall@568.3
15
Image-text-to-multimodal retrievalOVEN
R@575.3
14
Image-text-to-text retrievalOVEN
Recall@557.8
14
Visual Question AnsweringOKVQA (test)
Accuracy65.7
11
Contextual Image RetrievalVIST
R@13.12e+3
10
Showing 10 of 15 rows

Other info

GitHub

Follow for update