Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation

About

Visual evidence selection is a critical component of multimodal retrieval-augmented generation (RAG), yet existing methods typically rely on semantic relevance or surface-level similarity, which are often misaligned with the actual utility of visual evidence for downstream reasoning. We reformulate multimodal evidence selection from an information-theoretic perspective by defining evidence utility as the information gain induced on a model's output distribution. To overcome the intractability of answer-space optimization, we introduce a latent notion of evidence helpfulness and theoretically show that, under mild assumptions, ranking evidence by information gain on this latent variable is equivalent to answer-space utility. We further propose a training-free, surrogate-accelerated framework that efficiently estimates evidence utility using lightweight multimodal models. Experiments on MRAG-Bench and Visual-RAG across multiple model families demonstrate that our method consistently outperforms state-of-the-art RAG baselines while achieving substantial reductions in computational cost.

Weiqing Luo, Zongye Hu, Xiao Wang, Zhiyuan Yu, Haofeng Zhang, Ziyi Huang• 2026

Related benchmarks

Task	Dataset	Result	Rank
Visual Retrieval-Augmented Generation	Visual-RAG	Score65.11		39
Multi-modal Retrieval-Augmented Generation	MRAG-Bench	Accuracy61.2		9

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord