Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Moondream Segmentation: From Words to Masks

About

We present Moondream Segmentation, a referring image segmentation extension of Moondream 3, a vision-language model. Given an image and a referring expression, the model autoregressively decodes a vector path and iteratively refines the rasterized mask into a final detailed mask. We introduce a reinforcement learning stage that resolves ambiguity in the supervised signal by directly optimizing mask quality. Rollouts from this stage produce coarse-to-ground-truth targets for the refiner. To mitigate evaluation noise from polygon annotations, we release RefCOCO-M, a cleaned RefCOCO validation split with boundary-accurate masks. Moondream Segmentation achieves a cIoU of 80.2% on RefCOCO (val) and 62.6% mIoU on LVIS (val).

Ethan Reid• 2026

Related benchmarks

TaskDatasetResultRank
Instance SegmentationLVIS (val)--
46
Referring Image SegmentationRefCOCOg Google (val)--
15
Referring Image SegmentationRefCOCO UNC (val)--
10
Referring Image SegmentationRefCOCO UNC (testA)--
10
Referring Image SegmentationRefCOCO UNC (testB)--
10
Referring Image SegmentationRefCOCO+ UNC (val)
cIoU72.5
7
Referring Image SegmentationRefCOCO-M (val)
cIoU87.6
7
Referring Image SegmentationRefCOCO+ UNC (testB)
cIoU65.1
7
Referring Image SegmentationRefCOCOg Google (test)
cIoU73.9
7
Showing 9 of 9 rows

Other info

Follow for update