GeoSolver: Scaling Test-Time Reasoning in Remote Sensing with Fine-Grained Process Supervision
About
While Vision-Language Models (VLMs) have significantly advanced remote sensing interpretation, enabling them to perform complex, step-by-step reasoning remains highly challenging. Recent efforts to introduce Chain-of-Thought (CoT) reasoning to this domain have shown promise, yet ensuring the visual faithfulness of these intermediate steps remains a critical bottleneck. To address this, we introduce GeoSolver, a novel framework that transitions remote sensing reasoning toward verifiable, process-supervised reinforcement learning. We first construct Geo-PRM-2M, a large-scale, token-level process supervision dataset synthesized via entropy-guided Monte Carlo Tree Search (MCTS) and targeted visual hallucination injection. Building upon this dataset, we train GeoPRM, a token-level process reward model (PRM) that provides granular faithfulness feedback. To effectively leverage these verification signals, we propose Process-Aware Tree-GRPO, a reinforcement learning algorithm that integrates tree-structured exploration with a faithfulness-weighted reward mechanism to precisely assign credit to intermediate steps. Extensive experiments demonstrate that our resulting model, GeoSolver-9B, achieves state-of-the-art performance across diverse remote sensing benchmarks. Crucially, GeoPRM unlocks robust Test-Time Scaling (TTS). Serving as a universal geospatial verifier, it seamlessly scales the performance of GeoSolver-9B and directly enhances general-purpose VLMs, highlighting its remarkable cross-model generalization.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Scene Classification | AID | Top-1 Acc98.33 | 69 | |
| Image Captioning | RSICD | BLEU-436.18 | 37 | |
| Visual Grounding | DIOR-RSVG | -- | 34 | |
| Image Captioning | NWPU-Captions | BLEU-480.93 | 30 | |
| Visual Question Answering | RSVQA-HR | -- | 29 | |
| Remote Sensing Classification | SIRI-WHU | Top-1 Acc76 | 28 | |
| Scene Classification | WHU-RS19 | Accuracy99.5 | 22 | |
| Image Captioning | RSITMD | BLEU-452.5 | 21 | |
| Object Counting | DOTA v2 (val) | Accuracy45.92 | 19 | |
| Object Counting | HRRSD | Accuracy84.13 | 17 |