Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration

About

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in vision-language understanding. Recently, with the integration of test-time scaling techniques, these models have also shown strong potential in visual reasoning. However, most existing reasoning approaches remain text-level in nature: MLLMs are prompted to explore various combinations of textual tokens via their underlying language model, while the visual input remains fixed throughout the reasoning process. This paradigm limits the model's ability to fully exploit rich visual information, particularly when dealing with images containing numerous fine-grained elements. In such cases, vision-level reasoning becomes crucial - where models dynamically zoom into specific regions of the image to gather detailed visual cues necessary for accurate decision-making. In this paper, we propose Zoom Eye, a training-free, model-agnostic tree search algorithm tailored for vision-level reasoning. Zoom Eye treats an image as a hierarchical tree structure, where each child node represents a zoomed-in sub-region of its parent, and the root corresponds to the full image. The algorithm enables MLLMs to simulate human-like zooming behavior by navigating from root to leaf nodes in search of task-relevant visual evidence. We experiment on a series of high-resolution benchmarks and the results demonstrate that Zoom Eye consistently improves the performance of multiple MLLMs by a large margin (e.g., InternVL2.5-8B increases by 15.71% and 17.69% on HR-Bench) and also enables small 3-8B MLLMs to outperform strong large models such as GPT-4o. Code: https://github.com/om-ai-lab/ZoomEye

Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Mingwei Zhu, Jianwei Yin• 2024

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy88.94
1455
Visual Question AnsweringGQA
Accuracy62.1
505
Visual Question AnsweringInfoVQA (val)
Accuracy76.2
91
Visual Question AnsweringV*Bench
Accuracy90.6
84
High-resolution Visual UnderstandingHR-Bench-8K
FSP88.5
73
Visual Question AnsweringHRBench 4K
Accuracy0.684
54
Visual Question AnsweringHRBench-8K
Accuracy66.5
51
High-resolution perceptionHR-Bench-4K
Overall Score75.5
44
Visual Perception and ReasoningV*Bench
Attribute Score92.17
41
High-Resolution Visual PerceptionHR-Bench-4K
Accuracy70.13
40
Showing 10 of 27 rows

Other info

Follow for update