Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation

About

Reasoning segmentation has recently expanded from ground-level scenes to remote-sensing imagery, yet UAV data poses distinct challenges, including oblique viewpoints, ultra-high resolutions, and extreme scale variations. To address these issues, we formally define the UAV Reasoning Segmentation task and organize its semantic requirements into three dimensions: Spatial, Attribute, and Scene-level reasoning. Based on this formulation, we construct DRSeg, a large-scale benchmark for UAV reasoning segmentation, containing 10k high-resolution aerial images paired with Chain-of-Thought QA supervision across all three reasoning types. As a benchmark companion, we introduce PixDLM, a simple yet effective pixel-level multimodal language model that serves as a unified baseline for this task. Experiments on DRSeg establish strong baseline results and highlight the unique challenges of UAV reasoning segmentation, providing a solid foundation for future research.

Shuyan Ke, Yifan Mei, Changli Wu, Yonghan Zheng, Jiayi Ji, Liujuan Cao, Rongrong Ji• 2026

Related benchmarks

TaskDatasetResultRank
Referring SegmentationRefCOCO (val)
cIoU75.2
84
Referring SegmentationRefCOCO (testA)
cIoU80.2
83
Referring SegmentationRefCOCOg (val)
CIoU73.3
72
Referring SegmentationRefCOCO+ (testA)
cIoU0.733
60
Referring SegmentationRefCOCO (testB)
cIoU70.5
54
Referring SegmentationRefCOCO+ (val)
cIoU68.5
49
Referring SegmentationRefCOCOg (test)
cIoU72.8
40
Reasoning SegmentationDRSeg
Attribute Reasoning gIoU62.8
12
Showing 8 of 8 rows

Other info

Follow for update