Counterfactual Segmentation Reasoning: Diagnosing and Mitigating Pixel-Grounding Hallucination

About

Segmentation Vision-Language Models (VLMs) have significantly advanced grounded visual understanding, yet they remain prone to pixel-grounding hallucinations, producing masks for incorrect objects or for objects that are entirely absent. Existing evaluations rely almost entirely on text- or label-based perturbations, which check only whether the predicted mask matches the queried label. Such evaluations overlook the spatial footprint and severity of hallucination and therefore fail to reveal vision-driven hallucinations, which are more challenging and more prevalent. To address this gap, we formalize the task of Counterfactual Segmentation Reasoning (CSR), where a model must segment the referenced object in the factual image and abstain in its counterfactual counterpart. To support this task, we curate HalluSegBench, the first large-scale benchmark to diagnose referring and reasoning expression segmentation hallucinations using controlled visual counterfactuals, alongside new evaluation metrics that measure hallucination severity and disentangle vision- and language-driven failure modes. We further introduce RobustSeg, a segmentation VLM trained with counterfactual fine-tuning (CFT) to learn when to segment and when to abstain. Experimental results confirm RobustSeg reduces hallucinations by 30%, while improving segmentation performance on FP-RefCOCO(+/g).

Xinzhuo Li, Adheesh Juvekar, Jiaxun Zhang, Xingyou Liu, Muntasir Wahed, Kiet A. Nguyen, Yifan Shen, Tianjiao Yu, Ismini Lourentzou• 2025

Related benchmarks

Task	Dataset	Result
Referring Segmentation	FP-RefCOCO	Segment Score59.57	9
Referring Segmentation	RefCOCOg FP	Segment Score54.76	9
Reasoning Segmentation	HALLUSEGBENCH Reasoning	CMS Factual0.1541	9
Referring Segmentation	HALLUSEGBENCH Referring	CMS Factual10.62	9
Localization	FP-RefCOCO	See Score83.37	6
Localization	FP-RefCOCO+	See83	6
Localization	FP-RefCOCOg	See84.21	6
Segmentation	FP-RefCOCO+	Segmentation Score52.91	6

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord