FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering

About

While Multimodal Large Language Models (MLLMs) offer strong perception and reasoning capabilities for image-text input, Visual Question Answering (VQA) focusing on small image details still remains a challenge. Although visual cropping techniques seem promising, recent approaches have several limitations: the need for task-specific fine-tuning, low efficiency due to uninformed exhaustive search, or incompatibility with efficient attention implementations. We address these shortcomings by proposing a training-free visual cropping method, dubbed FOCUS, that leverages MLLM-internal representations to guide the search for the most relevant image region. This is accomplished in four steps: first, we identify the target object(s) in the VQA prompt; second, we compute an object relevance map using the key-value (KV) cache; third, we propose and rank relevant image regions based on the map; and finally, we perform the fine-grained VQA task using the top-ranked region. As a result of this informed search strategy, FOCUS achieves strong performance across four fine-grained VQA datasets and three types of MLLMs. It outperforms three popular visual cropping methods in both accuracy and efficiency, and matches the best-performing baseline, ZoomEye, while requiring 3 - 6.5 x less compute.

Liangyu Zhong, Fabio Rosenthal, Joachim Sicking, Fabian H\"uger, Thorsten Bagdonat, Hanno Gottschalk, Leo Schwinn• 2025

Related benchmarks

Task	Dataset	Result
Visual Question Answering	GQA	Accuracy0.6034	1425
Visual Question Answering	V*Bench	Accuracy90.58	94
Visual Question Answering	HRBench 4K	Accuracy0.7925	61
Visual Search	V* Benchmark	Overall Success Rate90.6	54
Visual Question Answering	HRBench-8K	Accuracy76.25	51
Fine-grained Visual Question Answering	V*Bench	Overall Accuracy92.15	28
Fine-grained Visual Question Answering	HRBench 4K	Overall Accuracy71.13	28
Fine-grained Visual Question Answering	HRBench-8K	Overall Accuracy69.63	28
Fine-grained visual search	HR-Bench-8K	Overall Score76.3	24
Reasoning	MME-RealWorld Lite (test)	OCR71	12

Showing 10 of 12 rows

Other info

Code

Follow for update

@wizwand_team Discord