SegEarth-R2: Towards Comprehensive Language-guided Segmentation for Remote Sensing Images

About

Effectively grounding complex language to pixels in remote sensing (RS) images is a critical challenge for applications like disaster response and environmental monitoring. Current models can parse simple, single-target commands but fail when presented with complex geospatial scenarios, e.g., segmenting objects at various granularities, executing multi-target instructions, and interpreting implicit user intent. To drive progress against these failures, we present LaSeRS, the first large-scale dataset built for comprehensive training and evaluation across four critical dimensions of language-guided segmentation: hierarchical granularity, target multiplicity, reasoning requirements, and linguistic variability. By capturing these dimensions, LaSeRS moves beyond simple commands, providing a benchmark for complex geospatial reasoning. This addresses a critical gap: existing datasets oversimplify, leading to sensitivity-prone real-world models. We also propose SegEarth-R2, an MLLM architecture designed for comprehensive language-guided segmentation in RS, which directly confronts these challenges. The model's effectiveness stems from two key improvements: (1) a spatial attention supervision mechanism specifically handles the localization of small objects and their components, and (2) a flexible and efficient segmentation query mechanism that handles both single-target and multi-target scenarios. Experimental results demonstrate that our SegEarth-R2 achieves outstanding performance on LaSeRS and other benchmarks, establishing a powerful baseline for the next generation of geospatial segmentation. All data and code will be released at https://github.com/earth-insights/SegEarth-R2.

Zepeng Xin, Kaiyu Li, Luodi Chen, Wanchen Li, Yuchen Xiao, Hui Qiao, Weizhan Zhang, Deyu Meng, Xiangyong Cao• 2025

Related benchmarks

Task	Dataset	Result
Referring Remote Sensing Image Segmentation	RRSIS-D (test)	--	57
Reasoning Segmentation	EarthReason (val)	gIoU72.3	47
Referring Segmentation	RISBench (test)	gIoU70.5	31
Reasoning Segmentation	EarthReason (test)	gIoU73.5	28
Referring Remote Sensing Image Segmentation	RRSIS-D (val)	--	28
Referring Expression Segmentation	RRSIS-D	mIoU67.9	27
Referring Expression Segmentation	LaSeRS (test)	gIoU (Semantic)60.2	8
Referring Segmentation	RefSegRS (val)	gIoU84.4	6
Referring Segmentation	RefSegRS (test)	gIoU74.8	6
Referring Segmentation	RISBench (val)	gIoU69.8	5

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord