ResAgent: Entropy-based Prior Point Discovery and Visual Reasoning for Referring Expression Segmentation
About
Referring Expression Segmentation (RES) is a core vision-language segmentation task that enables pixel-level understanding of targets via free-form linguistic expressions, supporting critical applications such as human-robot interaction and augmented reality. Despite the progress of Multimodal Large Language Model (MLLM)-based approaches, existing RES methods still suffer from two key limitations: first, the coarse bounding boxes from MLLMs lead to redundant or non-discriminative point prompts; second, the prevalent reliance on textual coordinate reasoning is unreliable, as it fails to distinguish targets from visually similar distractors. To address these issues, we propose \textbf{\model}, a novel RES framework integrating \textbf{E}ntropy-\textbf{B}ased Point \textbf{D}iscovery (\textbf{EBD}) and \textbf{V}ision-\textbf{B}ased \textbf{R}easoning (\textbf{VBR}). Specifically, EBD identifies high-information candidate points by modeling spatial uncertainty within coarse bounding boxes, treating point selection as an information maximization process. VBR verifies point correctness through joint visual-semantic alignment, abandoning text-only coordinate inference for more robust validation. Built on these components, \model implements a coarse-to-fine workflow: bounding box initialization, entropy-guided point discovery, vision-based validation, and mask decoding. Extensive evaluations on four benchmark datasets (RefCOCO, RefCOCO+, RefCOCOg, and ReasonSeg) demonstrate that \model achieves new state-of-the-art performance across all four benchmarks, highlighting its effectiveness in generating accurate and semantically grounded segmentation masks with minimal prompts.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Referring Expression Segmentation | RefCOCO (testA) | -- | 217 | |
| Referring Expression Segmentation | RefCOCO+ (val) | -- | 201 | |
| Referring Expression Segmentation | RefCOCO (testB) | -- | 191 | |
| Referring Expression Segmentation | RefCOCO (val) | -- | 190 | |
| Referring Expression Segmentation | RefCOCO+ (testA) | -- | 190 | |
| Referring Expression Segmentation | RefCOCO+ (testB) | -- | 188 | |
| Reasoning Segmentation | ReasonSeg (val) | cIoU68.38 | 145 | |
| Referring Expression Segmentation | RefCOCOg (val (U)) | -- | 89 | |
| Referring Expression Segmentation | RefCOCOg (test(U)) | -- | 78 |