PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation
About
Reasoning segmentation has recently expanded from ground-level scenes to remote-sensing imagery, yet UAV data poses distinct challenges, including oblique viewpoints, ultra-high resolutions, and extreme scale variations. To address these issues, we formally define the UAV Reasoning Segmentation task and organize its semantic requirements into three dimensions: Spatial, Attribute, and Scene-level reasoning. Based on this formulation, we construct DRSeg, a large-scale benchmark for UAV reasoning segmentation, containing 10k high-resolution aerial images paired with Chain-of-Thought QA supervision across all three reasoning types. As a benchmark companion, we introduce PixDLM, a simple yet effective pixel-level multimodal language model that serves as a unified baseline for this task. Experiments on DRSeg establish strong baseline results and highlight the unique challenges of UAV reasoning segmentation, providing a solid foundation for future research.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Referring Segmentation | RefCOCO (val) | cIoU75.2 | 84 | |
| Referring Segmentation | RefCOCO (testA) | cIoU80.2 | 83 | |
| Referring Segmentation | RefCOCOg (val) | CIoU73.3 | 72 | |
| Referring Segmentation | RefCOCO+ (testA) | cIoU0.733 | 60 | |
| Referring Segmentation | RefCOCO (testB) | cIoU70.5 | 54 | |
| Referring Segmentation | RefCOCO+ (val) | cIoU68.5 | 49 | |
| Referring Segmentation | RefCOCOg (test) | cIoU72.8 | 40 | |
| Reasoning Segmentation | DRSeg | Attribute Reasoning gIoU62.8 | 12 |