Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

GSVA: Generalized Segmentation via Multimodal Large Language Models

About

Generalized Referring Expression Segmentation (GRES) extends the scope of classic RES to refer to multiple objects in one expression or identify the empty targets absent in the image. GRES poses challenges in modeling the complex spatial relationships of the instances in the image and identifying non-existing referents. Multimodal Large Language Models (MLLMs) have recently shown tremendous progress in these complicated vision-language tasks. Connecting Large Language Models (LLMs) and vision models, MLLMs are proficient in understanding contexts with visual inputs. Among them, LISA, as a representative, adopts a special [SEG] token to prompt a segmentation mask decoder, e.g., SAM, to enable MLLMs in the RES task. However, existing solutions to GRES remain unsatisfactory since current segmentation MLLMs cannot correctly handle the cases where users might reference multiple subjects in a singular prompt or provide descriptions incongruent with any image target. In this paper, we propose Generalized Segmentation Vision Assistant (GSVA) to address this gap. Specifically, GSVA reuses the [SEG] token to prompt the segmentation model towards supporting multiple mask references simultaneously and innovatively learns to generate a [REJ] token to reject the null targets explicitly. Experiments validate GSVA's efficacy in resolving the GRES issue, marking a notable enhancement and setting a new record on the GRES benchmark gRefCOCO dataset. GSVA also proves effective across various classic referring segmentation and comprehension tasks.

Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, Gao Huang• 2023

Related benchmarks

TaskDatasetResultRank
Referring Expression ComprehensionRefCOCO+ (val)--
345
Referring Expression ComprehensionRefCOCO (val)--
335
Referring Expression ComprehensionRefCOCO (testA)--
333
Referring Expression ComprehensionRefCOCOg (val)--
291
Referring Expression ComprehensionRefCOCOg (test)--
291
Referring Expression ComprehensionRefCOCO+ (testB)--
235
Referring Expression SegmentationRefCOCO (testA)
cIoU81.7
217
Referring Expression ComprehensionRefCOCO+ (testA)--
207
Referring Expression SegmentationRefCOCO+ (val)
cIoU70.3
201
Referring Image SegmentationRefCOCO+ (test-B)
mIoU59.8
200
Showing 10 of 61 rows

Other info

Code

Follow for update