RemoteReasoner: Towards Unifying Geospatial Reasoning Workflow

About

Remote sensing imagery presents vast, inherently unstructured spatial data, necessitating sophisticated reasoning to interpret complex user intents and contextual relationships beyond simple recognition tasks. In this paper, we aim to construct an Earth observation workflow to handle complex queries by reasoning about spatial context and user intent. As a reasoning workflow, it should autonomously explore and construct its own inference paths, rather than being confined to predefined ground-truth sequences. Ideally, its architecture ought to be unified yet generalized, possessing capabilities to perform diverse reasoning tasks through one model without requiring additional fine-tuning. Existing remote sensing approaches rely on supervised fine-tuning paradigms and task-specific heads, limiting both autonomous reasoning and unified generalization. To this end, we propose RemoteReasoner, a unified workflow for geospatial reasoning. The design of RemoteReasoner integrates a multi-modal large language model (MLLM) for interpreting user instructions and localizing targets, together with task transformation strategies that enable multi-granularity tasks, including object-, region-, and pixel-level. In contrast to existing methods, our framework is trained with reinforcement learning (RL) to endow the MLLM sufficient reasoning autonomy. At the inference stage, our transformation strategies enable diverse task output formats without requiring task-specific decoders or further fine-tuning. Experiments demonstrated that RemoteReasoner achieves state-of-the-art (SOTA) performance across multi-granularity reasoning tasks. Furthermore, it retains the MLLM's inherent generalization capability, demonstrating robust performance on unseen tasks and out-of-distribution categories.

Liang Yao, Fan Liu, Hongbo Lu, Chuanyi Zhang, Rui Min, Shengxiang Xu, Shimin Di, Pai Peng• 2025

Related benchmarks

Task	Dataset	Result
Reasoning Segmentation	EarthReason (val)	gIoU69.29	47
Reasoning Segmentation	EarthReason (test)	gIoU71	28
Geospatial region reasoning	EarthReason (test)	Accuracy@0.568.11	13
Remote Sensing Image Captioning	Sydney (test)	ReconScore81.57	13
Remote Sensing Image Captioning	RSIEval (test)	ReconScore79.23	13
Remote Sensing Image Captioning	UCM (test)	ReconScore79.55	13
Socio-class Segmentation	SocioSeg (test)	cIoU42.9	10
Socio-function Segmentation	SocioSeg (test)	cIoU38	10
Socio-name Segmentation	SocioSeg (test)	cIoU46.6	10
Socio-semantic Segmentation	SocioSeg (test)	cIoU43.2	10

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord