GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery

About

The "thinking-with-images" paradigm enables multimodal large language models (MLLMs) to actively explore visual scenes via zoom-in tools. This is essential for ultra-high-resolution (UHR) remote sensing VQA, where task-relevant cues are sparse and tiny. However, we observe a consistent failure mode in existing zoom-enabled MLLMs: Tool Usage Homogenization, where tool calls collapse into task-agnostic patterns, limiting effective evidence acquisition. To address this, we propose GeoEyes, a staged training framework consisting of (1) a cold-start SFT dataset, UHR Chain-of-Zoom (UHR-CoZ), which covers diverse zooming regimes, and (2) an agentic reinforcement learning method, AdaZoom-GRPO, that explicitly rewards evidence gain and answer improvement during zoom interactions. The resulting model learns on-demand zooming with proper stopping behavior and achieves substantial improvements on UHR remote sensing benchmarks, with 54.23% accuracy on XLRS-Bench.

Fengxiang Wang, Mingshuo Chen, Yueying Li, Yajie Yang, Yifan Zhang, Long Lan, Xue Yang, Hongda Sun, Yulin Wang, Di Wang, Jun Song, Jing Zhang, Bo Du• 2026

Related benchmarks

Task	Dataset	Result
Visual Question Answering	XLRS-Bench L-3 Capability (test)	OC38.3	33
Remote Sensing Image Understanding	RSHR-Bench	Accuracy39.75	20
Remote Sensing Image Understanding	XLRS-Bench	Accuracy42.34	20
Visual Question Answering	LRS-VQA	Accuracy27.53	20
Remote Sensing Visual Question Answering	XLRS-Bench	Average Score0.542	17
Spatial Reasoning	UHR-Micro (test)	DrR63.43	16
Counting	UHR-Micro (test)	GC3.13	16
Vision-Language Reasoning	UHR-Micro (test)	Average Score9.53	16
Fine-grained Understanding	UHR-Micro (test)	OC Score1.72	16
Grounding	UHR-Micro (test)	GD Score0.11	16

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord