Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Towards Faithful Reasoning in Remote Sensing: A Perceptually-Grounded GeoSpatial Chain-of-Thought for Vision-Language Models

About

Vision-Language Models (VLMs) in remote sensing often fail at complex analytical tasks, a limitation stemming from their end-to-end training paradigm that bypasses crucial reasoning steps and leads to unverifiable outputs. To address this limitation, we introduce the Perceptually-Grounded Geospatial Chain-of-Thought (Geo-CoT), a framework that models remote sensing analysis as a verifiable, multi-step process. We instill this analytical process through a two-stage alignment strategy, leveraging Geo-CoT380k, the first large-scale dataset of structured Geo-CoT rationales. This strategy first employs supervised fine-tuning (SFT) to instill the foundational cognitive architecture, then leverages Group Reward Policy Optimization (GRPO) to refine the model's reasoning policy towards factual correctness. The resulting model, RSThinker, outputs both a final answer and its justifying, verifiable analytical trace. This capability yields dominant performance, significantly outperforming state-of-the-art models across a comprehensive range of tasks. The public release of our Geo-CoT380k dataset and RSThinker model upon publication serves as a concrete pathway from opaque perception towards structured, verifiable reasoning for Earth Observation.

Jiaqi Liu, Lang Sun, Ronghao Fu, Bo Yang• 2025

Related benchmarks

TaskDatasetResultRank
Object CountingHRRSD
Accuracy85.5
25
Object CountingRSOD
Accuracy37.3
19
Visual Question AnsweringGeoBench-VLM
Object Localization & Counting Score24.7
8
Visual Question AnsweringEarthVQA
Basic Judging Score69.4
8
Showing 4 of 4 rows

Other info

Follow for update