Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Efficient Segment Anything with Depth-Aware Fusion and Limited Training Data

About

Segment Anything Models (SAM) achieve impressive universal segmentation performance but require massive datasets (e.g., 11M images) and rely solely on RGB inputs. Recent efficient variants reduce computation but still depend on large-scale training. We propose a lightweight RGB-D fusion framework that augments EfficientViT-SAM with monocular depth priors. Depth maps are generated with a pretrained estimator and fused mid-level with RGB features through a dedicated depth encoder. Trained on only 11.2k samples (less than 0.1\% of SA-1B), our method achieves higher accuracy than EfficientViT-SAM, showing that depth cues provide strong geometric priors for segmentation.

Yiming Zhou, Xuenjie Xie, Panfeng Li, Albrecht Kunz, Ahmad Osman, Xavier Maldague• 2026

Related benchmarks

TaskDatasetResultRank
Point-Prompted SegmentationCOCO
mIoU75.5
12
Point-Prompted SegmentationLVIS
mIoU73.8
12
Instance SegmentationCOCO 2017 (val)
Mean IoU76.6
4
Showing 3 of 3 rows

Other info

Follow for update