Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AutoV: Loss-Oriented Ranking for Visual Prompt Retrieval in LVLMs

About

Inspired by text prompts in large language models, visual prompts have been explored to enhance the perceptual capabilities of large vision-language models (LVLMs). However, performance tends to saturate under single visual prompt designs, making further prompt engineering increasingly ineffective. To address this limitation, we shift from prompt engineering to prompt retrieval and propose AutoV, a lightweight framework for instance-adaptive visual prompt identification. Given an input image and a textual query, AutoV automatically locates the most suitable visual prompt from a diverse candidate pool. Training such a retrieval framework requires prompt-level supervision, yet prompt quality is inherently ambiguous and difficult to assess reliably, even for humans. To enable automatic supervision, we evaluate visual prompts using a pre-trained LVLM and label them according to their prediction losses. Using the loss-oriented ranking as a robust training signal, AutoV learns to retrieve the query-aware optimal prompt for each instance without manual annotation. Experiments indicate that AutoV enhances the performance of various LVLMs on image understanding, captioning, grounding, and classification tasks. For example, AutoV improves LLaVA-OV by $\textbf{10.2}\%$ on VizWiz and boosts Qwen2.5-VL by $\textbf{3.8}\%$ on MMMU, respectively.

Yuan Zhang, Chun-Kai Fan, Sicheng Yu, Junwen Pan, Tao Huang, Ming Lu, Kuan Cheng, Qi She, Shanghang Zhang• 2025

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVizWiz
Accuracy74.4
1525
Image ClassificationImageNet-1K
Top-1 Acc97.1
1239
Multimodal UnderstandingMMMU
Accuracy54
437
Multimodal UnderstandingMMStar--
324
Mathematical ReasoningMathVista
Accuracy70.5
257
Visual Question AnsweringRealworldQA
Accuracy71.1
179
Visual PerceptionBLINK
Accuracy59
122
Multi-modal UnderstandingLLaVA-Bench Wild
LLaVA^W Score88.4
86
Visual Question AnsweringVizWiz (test)
Accuracy71.9
79
Multimodal Visual PerceptionMMVP
Accuracy64.3
72
Showing 10 of 16 rows

Other info

Follow for update