Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval

About

Multimodal Large Language Models (MLLMs) have recently been applied to universal multimodal retrieval, where Chain-of-Thought (CoT) reasoning improves candidate reranking. However, existing approaches remain largely language-driven, relying on static visual encodings and lacking the ability to actively verify fine-grained visual evidence, which often leads to speculative reasoning in visually ambiguous cases. We propose V-Retrver, an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection. V-Retrver enables an MLLM to selectively acquire visual evidence during reasoning via external visual tools, performing a multimodal interleaved reasoning process that alternates between hypothesis generation and targeted visual verification.To train such an evidence-gathering retrieval agent, we adopt a curriculum-based learning strategy combining supervised reasoning activation, rejection-based refinement, and reinforcement learning with an evidence-aligned objective. Experiments across multiple multimodal retrieval benchmarks demonstrate consistent improvements in retrieval accuracy (with 23.0% improvements on average), perception-driven reasoning reliability, and generalization.

Dongyang Chen, Chaoyang Wang, Dezhao Su, Xi Xiao, Zeyu Zhang, Jing Xiong, Qing Li, Yuzhang Shang, Shichao Kan• 2026

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringE-VQA (test)
Accuracy58
85
Visual Question AnsweringInfoSeek (test)
Accuracy31.9
81
Composed Image RetrievalCIRCO
mAP@548.2
76
Multi-modal RetrievalM-BEIR (test)
Average Recall69.7
45
Image-text-to-text retrievalInfoSeek
Recall@570.3
20
Multimodal RetrievalMT-FIQ
Recall@568.3
15
Image-text-to-multimodal retrievalOVEN
R@575.3
14
Image-text-to-text retrievalOVEN
Recall@557.8
14
Visual Question AnsweringOKVQA (test)
Accuracy65.7
11
Contextual Image RetrievalVIST
R@13.12e+3
10
Showing 10 of 15 rows

Other info

GitHub

Follow for update