Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

About

Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.

Yan Shu, Bin Ren, Zhitong Xiong, Xiao Xiang Zhu, Beg\"um Demir, Nicu Sebe, Paolo Rota• 2026

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringRSVQA LR
Aggregated Score91.4
14
Visual Question AnsweringLandsat30 AU
APR69.84
7
Geospatial ReasoningTerraBench TerraScope-Bench
Accuracy68.9
5
Optical-SAR Damage AssessmentDisasterM3 (test)
BDC50.4
5
Geospatial ReasoningLandsat TerraScope-Bench
Accuracy73.9
3
Scene ClassificationBigEarthNet
Accuracy69.2
3
Showing 6 of 6 rows

Other info

GitHub

Follow for update