Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision

About

Conversational image segmentation grounds abstract, intent-driven concepts into pixel-accurate masks. Prior work on referring image grounding focuses on categorical and spatial queries (e.g., "left-most apple") and overlooks functional and physical reasoning (e.g., "where can I safely store the knife?"). We address this gap and introduce Conversational Image Segmentation (CIS) and ConverSeg, a benchmark spanning entities, spatial relations, intent, affordances, functions, safety, and physical reasoning. We also present ConverSeg-Net, which fuses strong segmentation priors with language understanding, and an AI-powered data engine that generates prompt-mask pairs without human supervision. We show that current language-guided segmentation models are inadequate for CIS, while ConverSeg-Net trained on our data engine achieves significant gains on ConverSeg and maintains strong performance on existing language-guided segmentation benchmarks. Project webpage: https://glab-caltech.github.io/converseg/

Aadarsh Sahoo, Georgia Gkioxari• 2026

Related benchmarks

Task	Dataset	Result
Reasoning Segmentation	ReasonSeg (test)	gIoU57	287
Referring Segmentation	RefCOCOg	cIoU74.9	25
Affordance Grounding	ReasonAff (test)	gIoU30.11	21
Affordance Grounding	UMD (test)	gIoU33.27	18
Referring Segmentation	RefCOCO	cIoU79.4	9
Reasoning Segmentation	Reasoning Segmentation (val)	gIoU61.9	6

Showing 6 of 6 rows

Other info

GitHub

Follow for update

@wizwand_team Discord