Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing

About

Multi-modal large language models (MLLMs) have achieved remarkable success in image- and region-level remote sensing (RS) image understanding tasks, such as image captioning, visual question answering, and visual grounding. However, existing RS MLLMs lack the pixel-level dialogue capability, which involves responding to user instructions with segmentation masks for specific instances. In this paper, we propose GeoPix, a RS MLLM that extends image understanding capabilities to the pixel level. This is achieved by equipping the MLLM with a mask predictor, which transforms visual features from the vision encoder into masks conditioned on the LLM's segmentation token embeddings. To facilitate the segmentation of multi-scale objects in RS imagery, a class-wise learnable memory module is integrated into the mask predictor to capture and store class-wise geo-context at the instance level across the entire dataset. In addition, to address the absence of large-scale datasets for training pixel-level RS MLLMs, we construct the GeoPixInstruct dataset, comprising 65,463 images and 140,412 instances, with each instance annotated with text descriptions, bounding boxes, and masks. Furthermore, we develop a two-stage training strategy to balance the distinct requirements of text generation and masks prediction in multi-modal multi-task optimization. Extensive experiments verify the effectiveness and superiority of GeoPix in pixel-level segmentation tasks, while also maintaining competitive performance in image- and region-level benchmarks.

Ruizhe Ou, Yuan Hu, Fan Zhang, Jiaxin Chen, Yu Liu• 2025

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringTraffic-VQA (test)
Overall Accuracy (OA)43.62
38
Visual Question AnsweringRSVQA-HR
Average Score32.2
29
Remote Sensing Scene ClassificationEuroSAT--
15
Visual Question AnsweringRSVQA LR
Aggregated Score17.7
14
Pixel-level Visual GroundingDVGBench
mIoU10.61
11
Image CaptioningNWPU-Captions
GEval0.072
10
Image CaptioningUCM Captions
GEval Score14.5
10
Remote Sensing Scene ClassificationAID
F1 Score5.4
10
Remote Sensing Scene ClassificationSkyScript bench
F1 Score0.6
10
Remote Sensing Scene ClassificationMillion-AID
F1 Score0.00e+0
10
Showing 10 of 17 rows

Other info

Follow for update