Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Zoom-Refine: Boosting High-Resolution Multimodal Understanding via Localized Zoom and Self-Refinement

About

Multimodal Large Language Models (MLLM) often struggle to interpret high-resolution images accurately, where fine-grained details are crucial for complex visual understanding. We introduce Zoom-Refine, a novel training-free method that enhances MLLM capabilities to address this issue. Zoom-Refine operates through a synergistic process of \textit{Localized Zoom} and \textit{Self-Refinement}. In the \textit{Localized Zoom} step, Zoom-Refine leverages the MLLM to provide a preliminary response to an input query and identifies the most task-relevant image region by predicting its bounding box coordinates. During the \textit{Self-Refinement} step, Zoom-Refine then integrates fine-grained details from the high-resolution crop (identified by \textit{Localized Zoom}) with its initial reasoning to re-evaluate and refine its preliminary response. Our method harnesses the MLLM's inherent capabilities for spatial localization, contextual reasoning and comparative analysis without requiring additional training or external experts. Comprehensive experiments demonstrate the efficacy of Zoom-Refine on two challenging high-resolution multimodal benchmarks. Code is available at \href{https://github.com/xavier-yu114/Zoom-Refine}{\color{magenta}github.com/xavier-yu114/Zoom-Refine}

Xuan Yu, Dayan Guan, Yanfeng Gu• 2025

Related benchmarks

TaskDatasetResultRank
Visual Grounded ReasoningTreeBench
Overall Score38
128
High-resolution Visual UnderstandingHR-Bench-8K
FSP92
73
High-resolution perceptionHR-Bench-4K
Overall Score77
44
Visual Perception and ReasoningV*Bench
Attribute Score86.09
41
Visually Grounded ReasoningV*Bench
Average Accuracy82.2
32
Visually Grounded ReasoningV* bench (test)
Overall Accuracy82.2
17
Hallucination EvaluationSPD-Faith Bench
CR39.7
7
Faithfulness EvaluationSPD-Faith Bench (test)
DS43.4
7
Showing 8 of 8 rows

Other info

Follow for update