Learning to Label: A Reinforced Self-Evolving Framework for Semi-supervised Referring Expression Segmentation
About
Semi-supervised referring expression segmentation (SS-RES) aims to achieve precise pixel-level language grounding under limited annotation, yet suffers from limited supervision and unreliable pseudo-labels when exploiting unlabeled image-text pairs. In this work, we propose Learning to Label, a reinforced self-evolving framework (L2L) that casts pseudo-label construction as a learnable decision-making process. To build foundational understanding, we leverage a multimodal large language model to extract semantic-spatial priors, which are instantiated as initial soft segmentation proposals and elevated, together with textual cues, into learnable guidance signals that condition a hierarchical segmentation network. To ensure stable learning, reinforced pseudo-label selection is formulated as an exploratory decision process that adaptively rewards high-utility pixel-level supervision based on multimodal priors and model predictions. This reinforced self-evolving loop enables joint optimization of the segmentation model and pseudo-labels, progressively enhancing label reliability under sparse supervision. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg demonstrate improvements over existing methods, validating its effectiveness and generalization.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Referring Expression Segmentation | RefCOCO (testA) | -- | 315 | |
| Referring Expression Segmentation | RefCOCO+ (testA) | -- | 288 | |
| Referring Expression Segmentation | RefCOCO+ (val) | -- | 272 | |
| Referring Expression Segmentation | RefCOCO (val) | -- | 261 | |
| Referring Expression Segmentation | RefCOCO (testB) | -- | 259 | |
| Referring Expression Segmentation | RefCOCO+ (testB) | -- | 256 | |
| Referring Expression Segmentation | RefCOCOg (val (U)) | -- | 95 | |
| Referring Expression Segmentation | RefCOCOg (test(U)) | -- | 78 | |
| Referring Expression Segmentation | RefCOCOg UMD (val) | mIoU60.63 | 52 | |
| Referring Expression Segmentation | RefCOCOg UMD (test-u) | mIoU62.66 | 46 |