LookWise: Knowing When and Where to Look for Fine-Grained Visual Reasoning in Multimodal Large Language Models
About
Multimodal Large Language Models (MLLMs) are shifting towards "Thinking with Images" by actively exploring image details. While effective, large-scale training is computationally expensive, which has spurred growing interest in lightweight, training-free solutions. However, existing training-free methods suffer from two flaws: perceptual redundancy from indiscriminate cropping, which increases computational cost and introduces noise; and a drift between semantic intent and spatial attention, which prevents accurate localization of user-focused regions. To address these challenges, we propose LookWise, a framework for adaptive visual reasoning. LookWise follows a two-stage pipeline: a confidence-based module decides when to look more carefully, and a semantic-guided localization module determines where to look. This design enables MLLMs to adaptively acquire fine-grained visual evidence without additional training. Experiments on fine-grained and high-resolution visual reasoning benchmarks show that LookWise consistently improves accuracy over strong baselines while achieving an approximately $4.0\times$ inference speedup over the search-based method ZoomEye, demonstrating robust cross-model generalization.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy89.12 | 2019 | |
| High-Resolution Visual Perception | HR-Bench-4K | Accuracy73.25 | 79 | |
| High-Resolution Visual Perception | HR-Bench-8K | Accuracy70 | 63 | |
| Visual Perception Reasoning | V*Bench | Score86.38 | 28 | |
| Visual Question Answering | AOKVQA | Accuracy73.1 | 8 |