Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

About

MLLMs require high-resolution visual inputs for fine-grained tasks like document understanding and dense scene perception. However, current global resolution scaling paradigms indiscriminately flood the quadratic self-attention mechanism with visually redundant tokens, severely bottlenecking inference throughput while ignoring spatial sparsity and query intent. To overcome this, we propose Q-Zoom, a query-aware adaptive high-resolution perception framework that operates in an efficient coarse-to-fine manner. First, a lightweight Dynamic Gating Network safely bypasses high-resolution processing when coarse global features suffice. Second, for queries demanding fine-grained perception, a Self-Distilled Region Proposal Network (SD-RPN) precisely localizes the task-relevant Region-of-Interest (RoI) directly from intermediate feature spaces. To optimize these modules efficiently, the gating network uses a consistency-aware generation strategy to derive deterministic routing labels, while the SD-RPN employs a fully self-supervised distillation paradigm. A continuous spatio-temporal alignment scheme and targeted fine-tuning then seamlessly fuse the dense local RoI with the coarse global layout. Extensive experiments demonstrate that Q-Zoom establishes a dominant Pareto frontier. Using Qwen2.5-VL-7B as a primary testbed, Q-Zoom accelerates inference by 2.52 times on Document & OCR benchmarks and 4.39 times in High-Resolution scenarios while matching the baseline's peak accuracy. Furthermore, when configured for maximum perceptual fidelity, Q-Zoom surpasses the baseline's peak performance by 1.1% and 8.1% on these respective benchmarks. These robust improvements transfer seamlessly to Qwen3-VL, LLaVA, and emerging RL-based thinking-with-image models. Project page is available at https://yuhengsss.github.io/Q-Zoom/.

Yuheng Shi, Xiaohuan Pei, Linfeng Wen, Minjing Dong, Chang Xu• 2026

Related benchmarks

Task	Dataset	Result
Text-based Visual Question Answering	TextVQA (val)	Accuracy83.5	276
Chart Question Answering	ChartQA (test)	Accuracy85.6	190
Document Visual Question Answering	DocVQA (val)	Accuracy94.3	166
Visual Question Answering	InfoVQA (val)	Accuracy79.4	91
High-resolution Visual Understanding	HR-Bench-8K	FSP92.5	83
High-resolution Visual Understanding	HR-Bench-4K	FSP93.8	49
Real-world Multimodal Understanding	MME-RealWorld-Lite	Lite Score54.9	25
Vision-centric Reasoning	V* Bench (Overall)	Attribute Score96.5	24
Optical Character Recognition Benchmarking	OCRBench (test)	Accuracy85.4	21

Showing 9 of 9 rows

Other info

GitHub

Follow for update

@wizwand_team Discord