GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing

About

Multi-modal large language models (MLLMs) have achieved remarkable success in image- and region-level remote sensing (RS) image understanding tasks, such as image captioning, visual question answering, and visual grounding. However, existing RS MLLMs lack the pixel-level dialogue capability, which involves responding to user instructions with segmentation masks for specific instances. In this paper, we propose GeoPix, a RS MLLM that extends image understanding capabilities to the pixel level. This is achieved by equipping the MLLM with a mask predictor, which transforms visual features from the vision encoder into masks conditioned on the LLM's segmentation token embeddings. To facilitate the segmentation of multi-scale objects in RS imagery, a class-wise learnable memory module is integrated into the mask predictor to capture and store class-wise geo-context at the instance level across the entire dataset. In addition, to address the absence of large-scale datasets for training pixel-level RS MLLMs, we construct the GeoPixInstruct dataset, comprising 65,463 images and 140,412 instances, with each instance annotated with text descriptions, bounding boxes, and masks. Furthermore, we develop a two-stage training strategy to balance the distinct requirements of text generation and masks prediction in multi-modal multi-task optimization. Extensive experiments verify the effectiveness and superiority of GeoPix in pixel-level segmentation tasks, while also maintaining competitive performance in image- and region-level benchmarks.

Ruizhe Ou, Yuan Hu, Fan Zhang, Jiaxin Chen, Yu Liu• 2025

Related benchmarks

Task	Dataset	Result
Referring Remote Sensing Image Segmentation	RRSIS-D (test)	Mean IoU (mIoU)67.99	57
Video Question Answering	Traffic-VQA (test)	Overall Accuracy (OA)43.62	38
Visual Question Answering	RSVQA-HR	Average Score32.2	38
Reasoning Segmentation	EarthReason (test)	--	28
Remote Sensing Scene Classification	EuroSAT	--	15
Visual Question Answering	RSVQA LR	Aggregated Score17.7	14
Remote Sensing Image Captioning	Sydney (test)	ReconScore79.6	13
Remote Sensing Image Captioning	RSIEval (test)	ReconScore77.34	13
Remote Sensing Image Captioning	UCM (test)	ReconScore77.19	13
Reasoning Segmentation	DRSeg	Attribute Reasoning gIoU42.96	12

Showing 10 of 29 rows

Other info

Follow for update

@wizwand_team Discord