FOCUS: Internal MLLM Representations for Efficient Fine-Grained Visual Question Answering
About
While Multimodal Large Language Models (MLLMs) offer strong perception and reasoning capabilities for image-text input, Visual Question Answering (VQA) focusing on small image details still remains a challenge. Although visual cropping techniques seem promising, recent approaches have several limitations: the need for task-specific fine-tuning, low efficiency due to uninformed exhaustive search, or incompatibility with efficient attention implementations. We address these shortcomings by proposing a training-free visual cropping method, dubbed FOCUS, that leverages MLLM-internal representations to guide the search for the most relevant image region. This is accomplished in four steps: first, we identify the target object(s) in the VQA prompt; second, we compute an object relevance map using the key-value (KV) cache; third, we propose and rank relevant image regions based on the map; and finally, we perform the fine-grained VQA task using the top-ranked region. As a result of this informed search strategy, FOCUS achieves strong performance across four fine-grained VQA datasets and three types of MLLMs. It outperforms three popular visual cropping methods in both accuracy and efficiency, and matches the best-performing baseline, ZoomEye, while requiring 3 - 6.5 x less compute.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | GQA | Accuracy0.6034 | 963 | |
| Fine-grained Visual Question Answering | V*Bench | Overall Accuracy92.15 | 28 | |
| Fine-grained Visual Question Answering | HRBench 4K | Overall Accuracy71.13 | 28 | |
| Fine-grained Visual Question Answering | HRBench-8K | Overall Accuracy69.63 | 28 | |
| Visual Question Answering | V*Bench | Accuracy90.58 | 17 | |
| Visual Question Answering | HRBench 4K | Accuracy0.7925 | 12 | |
| Visual Question Answering | HRBench-8K | Accuracy76.25 | 12 | |
| Perception | MME-RealWorld Lite (test) | OCR83.6 | 3 | |
| Reasoning | MME-RealWorld Lite (test) | OCR71 | 3 |