Annotation-Free Visual Reasoning for High-Resolution Large Multimodal Models via Reinforcement Learning

About

Current Large Multimodal Models (LMMs) struggle with high-resolution visual inputs during the reasoning process, as the number of image tokens increases quadratically with resolution, introducing substantial redundancy and irrelevant information. A common practice is to identify key image regions and refer to their high-resolution counterparts during reasoning, typically trained with external visual supervision. However, such visual supervision cues require costly grounding labels from human annotators. Meanwhile, it remains an open question how to enhance a model's grounding abilities to support reasoning without relying on additional annotations. In this paper, we propose High-resolution Annotation-free Reasoning Technique (HART), a closed-loop framework that enables LMMs to focus on and self-verify key regions of high-resolution visual inputs. HART incorporates a post-training paradigm in which we design Advantage Preference Group Relative Policy Optimization (AP-GRPO) to encourage accurate localization of key regions without external visual annotations. Notably, HART provides explainable reasoning pathways and enables efficient optimization of localization. Extensive experiments on MME-RealWorld-Lite, TreeBench, V* Bench, HR-Bench-4K/8K, and MMStar demonstrate that HART improves performance across a wide range of high-resolution visual tasks, consistently outperforming strong baselines.

Jiacheng Yang, Anqi Chen, Yunkai Dang, Qi Fan, Cong Wang, Wenbin Li, Feng Miao, Yang Gao• 2026

Related benchmarks

Task	Dataset	Result
Visual Grounded Reasoning	TreeBench	Overall Score43.7	153
Visual Question Answering	HRBench 4K	Accuracy0.711	61
Multimodal Question Answering	MME-RealWorld-Lite 1.0 (test)	Perception (AD) Acc57.7	19
Multi-modal Question Answering	MMStar	Accuracy62.8	13
Visual Grounding	TreeBench	Error Rate24.6	6
Visual Grounding	Visual-CoT	Error Rate22.3	6
Multimodal Question Answering	V*Bench	Answer Accuracy80.6	4
Multimodal Question Answering	HR-Bench-8K	Answer Accuracy71.9	4

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord