Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering

About

While Multimodal Large Language Models (MLLMs) offer strong perception and reasoning capabilities for image-text input, Visual Question Answering (VQA) focusing on small image details still remains a challenge. Although visual cropping techniques seem promising, recent approaches have several limitations: the need for task-specific fine-tuning, low efficiency due to uninformed exhaustive search, or incompatibility with efficient attention implementations. We address these shortcomings by proposing a training-free visual cropping method, dubbed FOCUS, that leverages MLLM-internal representations to guide the search for the most relevant image region. This is accomplished in four steps: first, we identify the target object(s) in the VQA prompt; second, we compute an object relevance map using the key-value (KV) cache; third, we propose and rank relevant image regions based on the map; and finally, we perform the fine-grained VQA task using the top-ranked region. As a result of this informed search strategy, FOCUS achieves strong performance across four fine-grained VQA datasets and three types of MLLMs. It outperforms three popular visual cropping methods in both accuracy and efficiency, and matches the best-performing baseline, ZoomEye, while requiring 3 - 6.5 x less compute.

Liangyu Zhong, Fabio Rosenthal, Joachim Sicking, Fabian H\"uger, Thorsten Bagdonat, Hanno Gottschalk, Leo Schwinn• 2025

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringGQA
Accuracy0.6034
963
Fine-grained Visual Question AnsweringV*Bench
Overall Accuracy92.15
28
Fine-grained Visual Question AnsweringHRBench 4K
Overall Accuracy71.13
28
Fine-grained Visual Question AnsweringHRBench-8K
Overall Accuracy69.63
28
Visual Question AnsweringV*Bench
Accuracy90.58
17
Visual Question AnsweringHRBench 4K
Accuracy0.7925
12
Visual Question AnsweringHRBench-8K
Accuracy76.25
12
PerceptionMME-RealWorld Lite (test)
OCR83.6
3
ReasoningMME-RealWorld Lite (test)
OCR71
3
Showing 9 of 9 rows

Other info

Code

Follow for update