Counterfactual Segmentation Reasoning: Diagnosing and Mitigating Pixel-Grounding Hallucination
About
Segmentation Vision-Language Models (VLMs) have significantly advanced grounded visual understanding, yet they remain prone to pixel-grounding hallucinations, producing masks for incorrect objects or for objects that are entirely absent. Existing evaluations rely almost entirely on text- or label-based perturbations, which check only whether the predicted mask matches the queried label. Such evaluations overlook the spatial footprint and severity of hallucination and therefore fail to reveal vision-driven hallucinations, which are more challenging and more prevalent. To address this gap, we formalize the task of Counterfactual Segmentation Reasoning (CSR), where a model must segment the referenced object in the factual image and abstain in its counterfactual counterpart. To support this task, we curate HalluSegBench, the first large-scale benchmark to diagnose referring and reasoning expression segmentation hallucinations using controlled visual counterfactuals, alongside new evaluation metrics that measure hallucination severity and disentangle vision- and language-driven failure modes. We further introduce RobustSeg, a segmentation VLM trained with counterfactual fine-tuning (CFT) to learn when to segment and when to abstain. Experimental results confirm RobustSeg reduces hallucinations by 30%, while improving segmentation performance on FP-RefCOCO(+/g).
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Referring Segmentation | FP-RefCOCO | Segment Score59.57 | 9 | |
| Referring Segmentation | RefCOCOg FP | Segment Score54.76 | 9 | |
| Reasoning Segmentation | HALLUSEGBENCH Reasoning | CMS Factual0.1541 | 9 | |
| Referring Segmentation | HALLUSEGBENCH Referring | CMS Factual10.62 | 9 | |
| Localization | FP-RefCOCO | See Score83.37 | 6 | |
| Localization | FP-RefCOCO+ | See83 | 6 | |
| Localization | FP-RefCOCOg | See84.21 | 6 | |
| Segmentation | FP-RefCOCO+ | Segmentation Score52.91 | 6 |