Moondream Segmentation: From Words to Masks
About
We present Moondream Segmentation, a referring image segmentation extension of Moondream 3, a vision-language model. Given an image and a referring expression, the model autoregressively decodes a vector path and iteratively refines the rasterized mask into a final detailed mask. We introduce a reinforcement learning stage that resolves ambiguity in the supervised signal by directly optimizing mask quality. Rollouts from this stage produce coarse-to-ground-truth targets for the refiner. To mitigate evaluation noise from polygon annotations, we release RefCOCO-M, a cleaned RefCOCO validation split with boundary-accurate masks. Moondream Segmentation achieves a cIoU of 80.2% on RefCOCO (val) and 62.6% mIoU on LVIS (val).
Ethan Reid• 2026
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Instance Segmentation | LVIS (val) | -- | 46 | |
| Referring Image Segmentation | RefCOCOg Google (val) | -- | 15 | |
| Referring Image Segmentation | RefCOCO UNC (val) | -- | 10 | |
| Referring Image Segmentation | RefCOCO UNC (testA) | -- | 10 | |
| Referring Image Segmentation | RefCOCO UNC (testB) | -- | 10 | |
| Referring Image Segmentation | RefCOCO+ UNC (val) | cIoU72.5 | 7 | |
| Referring Image Segmentation | RefCOCO-M (val) | cIoU87.6 | 7 | |
| Referring Image Segmentation | RefCOCO+ UNC (testB) | cIoU65.1 | 7 | |
| Referring Image Segmentation | RefCOCOg Google (test) | cIoU73.9 | 7 |
Showing 9 of 9 rows