AutoV: Loss-Oriented Ranking for Visual Prompt Retrieval in LVLMs

About

Inspired by text prompts in large language models, visual prompts have been explored to enhance the perceptual capabilities of large vision-language models (LVLMs). However, performance tends to saturate under single visual prompt designs, making further prompt engineering increasingly ineffective. To address this limitation, we shift from prompt engineering to prompt retrieval and propose AutoV, a lightweight framework for instance-adaptive visual prompt identification. Given an input image and a textual query, AutoV automatically locates the most suitable visual prompt from a diverse candidate pool. Training such a retrieval framework requires prompt-level supervision, yet prompt quality is inherently ambiguous and difficult to assess reliably, even for humans. To enable automatic supervision, we evaluate visual prompts using a pre-trained LVLM and label them according to their prediction losses. Using the loss-oriented ranking as a robust training signal, AutoV learns to retrieve the query-aware optimal prompt for each instance without manual annotation. Experiments indicate that AutoV enhances the performance of various LVLMs on image understanding, captioning, grounding, and classification tasks. For example, AutoV improves LLaVA-OV by $\textbf{10.2}\%$ on VizWiz and boosts Qwen2.5-VL by $\textbf{3.8}\%$ on MMMU, respectively.

Yuan Zhang, Chun-Kai Fan, Sicheng Yu, Junwen Pan, Tao Huang, Ming Lu, Kuan Cheng, Qi She, Shanghang Zhang• 2025

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VizWiz	Accuracy74.4	1863
Image Classification	ImageNet-1K	Top-1 Acc97.1	1239
Multimodal Understanding	MMStar	--	511
Multimodal Understanding	MMMU	Accuracy54	437
Mathematical Reasoning	MathVista	Accuracy70.5	382
Visual Question Answering	RealworldQA	Accuracy71.1	327
Visual Perception	BLINK	Accuracy59	255
Visual Question Answering	VizWiz (test)	Accuracy71.9	136
Multimodal Visual Perception	MMVP	Accuracy64.3	106
Text-based Visual Question Answering	VQAText	Accuracy86.1	89

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord