RemoteSAM: Towards Segment Anything for Earth Observation
About
We aim to develop a robust yet flexible visual foundation model for Earth observation. It should possess strong capabilities in recognizing and localizing diverse visual targets while providing compatibility with various input-output interfaces required across different task scenarios. Current systems cannot meet these requirements, as they typically utilize task-specific architecture trained on narrow data domains with limited semantic coverage. Our study addresses these limitations from two aspects: data and modeling. We first introduce an automatic data engine that enjoys significantly better scalability compared to previous human annotation or rule-based approaches. It has enabled us to create the largest dataset of its kind to date, comprising 270K image-text-mask triplets covering an unprecedented range of diverse semantic categories and attribute specifications. Based on this data foundation, we further propose a task unification paradigm that centers around referring expression segmentation. It effectively handles a wide range of vision-centric perception tasks, including classification, detection, segmentation, grounding, etc, using a single model without any task-specific heads. Combining these innovations on data and modeling, we present RemoteSAM, a foundation model that establishes new SoTA on several earth observation perception benchmarks, outperforming other foundation models such as Falcon, GeoChat, and LHRS-Bot with significantly higher efficiency. Models and data are publicly available at https://github.com/1e12Leon/RemoteSAM.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | Potsdam (test) | mIoU64.05 | 193 | |
| Semantic segmentation | iSAID | mIoU64.72 | 146 | |
| Referring Remote Sensing Image Segmentation | RRSIS-D (test) | Precision @ IoU 0.573.56 | 36 | |
| Referring Remote Sensing Image Segmentation | RRSIS-D (val) | mIoU (Mean IoU)64.11 | 28 | |
| Referring Remote Sensing Image Segmentation | RefSegRS (val) | Pr@0.596.29 | 23 | |
| Referring Remote Sensing Image Segmentation | RefSegRS (test) | Pr@0.579.2 | 23 | |
| Semantic segmentation | Potsdam | mF191.8 | 19 | |
| Visual Grounding | VRS Bench (test) | mIoU62.83 | 16 | |
| Visual Grounding | NWPU VHR-10 (test) | mIoU56.62 | 16 | |
| Visual Grounding | VRSBench | mIoU62.83 | 15 |