Bridging Semantics and Geometry: A Decoupled LVLM-SAM Framework for Reasoning Segmentation in Optical Remote Sensing

About

Large Vision--Language Models (LVLMs) hold great promise for advancing optical remote sensing (RS) analysis, yet existing reasoning segmentation frameworks couple linguistic reasoning and pixel prediction through end-to-end supervised fine-tuning, leading to weak geometric grounding and limited generalization across tasks. To address this, we developed Think2Seg-RS, a decoupled framework that trains an LVLM prompter to control a frozen Segment Anything Model (SAM) via structured geometric prompts. Through a mask-only Group Relative Policy Optimization (GRPO) reinforcement learning objective driven strictly by final mask IoU, the LVLM learns to translate abstract semantic reasoning into spatially grounded actions, achieving state-of-the-art performance on the EarthReason dataset. Notably, Think2Seg-RS outperforms leading approaches such as RemoteReasoner and SegEarth-R1 on the EarthReason dataset by reaching a test cIoU of 75.60% and gIoU of 73.36%, yielding absolute improvements of 6.47% and 2.40% over the strongest baseline, respectively. Zero-shot evaluations across three referring segmentation benchmarks reveal a fundamental distinction in task inductive bias, exposing a distinct divide between semantic-level grounding -- which aggregates all regions matching a conceptual intent -- and instance-level tasks that demand discrete object separation. We further found that compact segmenters outperform larger ones under semantic-level supervision by mitigating textural over-segmentation, and that unconstrained negative prompting is unstable in heterogeneous aerial backgrounds. Together, these findings demonstrate that optimizing LVLMs through direct segmentation feedback offers a scalable framework for complex geospatial reasoning, effectively bridging the gap between abstract language understanding and precise pixel-level execution.

Xu Zhang, Junyao Ge, Yang Zheng, Kaitai Guo, Jimin Liang• 2025

Related benchmarks

Task	Dataset	Result
Reasoning Segmentation	EarthReason (val)	gIoU72.46	47
Referring Segmentation	RISBench (test)	gIoU47.72	31
Reasoning Segmentation	EarthReason (test)	gIoU73.73	28
Referring Expression Segmentation	RRSIS-D (test)	gIoU47.97	8
Referring Expression Segmentation	RefSegRS (test)	gIoU15.53	7

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord