GeoEyes: On-Demand Visual Focusing for Evidence-Grounded Understanding of Ultra-High-Resolution Remote Sensing Imagery
About
The "thinking-with-images" paradigm enables multimodal large language models (MLLMs) to actively explore visual scenes via zoom-in tools. This is essential for ultra-high-resolution (UHR) remote sensing VQA, where task-relevant cues are sparse and tiny. However, we observe a consistent failure mode in existing zoom-enabled MLLMs: Tool Usage Homogenization, where tool calls collapse into task-agnostic patterns, limiting effective evidence acquisition. To address this, we propose GeoEyes, a staged training framework consisting of (1) a cold-start SFT dataset, UHR Chain-of-Zoom (UHR-CoZ), which covers diverse zooming regimes, and (2) an agentic reinforcement learning method, AdaZoom-GRPO, that explicitly rewards evidence gain and answer improvement during zoom interactions. The resulting model learns on-demand zooming with proper stopping behavior and achieves substantial improvements on UHR remote sensing benchmarks, with 54.23% accuracy on XLRS-Bench.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | XLRS-Bench L-3 Capability (test) | OC38.3 | 33 | |
| Remote Sensing Image Understanding | RSHR-Bench | Accuracy39.75 | 20 | |
| Remote Sensing Image Understanding | XLRS-Bench | Accuracy42.34 | 20 | |
| Visual Question Answering | LRS-VQA | Accuracy27.53 | 20 | |
| Remote Sensing Visual Question Answering | XLRS-Bench | Average Score0.542 | 17 | |
| Spatial Reasoning | UHR-Micro (test) | DrR63.43 | 16 | |
| Counting | UHR-Micro (test) | GC3.13 | 16 | |
| Vision-Language Reasoning | UHR-Micro (test) | Average Score9.53 | 16 | |
| Fine-grained Understanding | UHR-Micro (test) | OC Score1.72 | 16 | |
| Grounding | UHR-Micro (test) | GD Score0.11 | 16 |