TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

About

Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.

Yan Shu, Bin Ren, Zhitong Xiong, Xiao Xiang Zhu, Beg\"um Demir, Nicu Sebe, Paolo Rota• 2026

Related benchmarks

Task	Dataset	Result
Visual Question Answering	RSVQA LR	Aggregated Score91.4	14
Visual Question Answering	Landsat30 AU	APR69.84	7
Geospatial Reasoning	TerraBench TerraScope-Bench	Accuracy68.9	5
Optical-SAR Damage Assessment	DisasterM3 (test)	BDC50.4	5
Geospatial Reasoning	Landsat TerraScope-Bench	Accuracy73.9	3
Scene Classification	BigEarthNet	Accuracy69.2	3

Showing 6 of 6 rows

Other info

GitHub

Follow for update

@wizwand_team Discord