Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery

About

The "thinking-with-images" paradigm enables multimodal large language models (MLLMs) to actively explore visual scenes via zoom-in tools. This is essential for ultra-high-resolution (UHR) remote sensing VQA, where task-relevant cues are sparse and tiny. However, we observe a consistent failure mode in existing zoom-enabled MLLMs: Tool Usage Homogenization, where tool calls collapse into task-agnostic patterns, limiting effective evidence acquisition. To address this, we propose GeoEyes, a staged training framework consisting of (1) a cold-start SFT dataset, UHR Chain-of-Zoom (UHR-CoZ), which covers diverse zooming regimes, and (2) an agentic reinforcement learning method, AdaZoom-GRPO, that explicitly rewards evidence gain and answer improvement during zoom interactions. The resulting model learns on-demand zooming with proper stopping behavior and achieves substantial improvements on UHR remote sensing benchmarks, with 54.23% accuracy on XLRS-Bench.

Fengxiang Wang, Mingshuo Chen, Yueying Li, Yajie Yang, Yifan Zhang, Long Lan, Xue Yang, Hongda Sun, Yulin Wang, Di Wang, Jun Song, Jing Zhang, Bo Du• 2026

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringXLRS-Bench L-3 Capability (test)
OC38.3
33
Remote Sensing Image UnderstandingRSHR-Bench
Accuracy39.75
20
Remote Sensing Image UnderstandingXLRS-Bench
Accuracy42.34
20
Visual Question AnsweringLRS-VQA
Accuracy27.53
20
Remote Sensing Visual Question AnsweringXLRS-Bench
Average Score0.542
17
Spatial ReasoningUHR-Micro (test)
DrR63.43
16
CountingUHR-Micro (test)
GC3.13
16
Vision-Language ReasoningUHR-Micro (test)
Average Score9.53
16
Fine-grained UnderstandingUHR-Micro (test)
OC Score1.72
16
GroundingUHR-Micro (test)
GD Score0.11
16
Showing 10 of 10 rows

Other info

Follow for update