HyperSeg: Towards Universal Visual Segmentation with Large Language Model

About

This paper aims to address universal segmentation for image and video perception with the strong reasoning ability empowered by Visual Large Language Models (VLLMs). Despite significant progress in current unified segmentation methods, limitations in adaptation to both image and video scenarios, as well as the complex reasoning segmentation, make it difficult for them to handle various challenging instructions and achieve an accurate understanding of fine-grained vision-language correlations. We propose HyperSeg, the first VLLM-based universal segmentation model for pixel-level image and video perception, encompassing generic segmentation tasks and more complex reasoning perception tasks requiring powerful reasoning abilities and world knowledge. Besides, to fully leverage the recognition capabilities of VLLMs and the fine-grained visual information, HyperSeg incorporates hybrid entity recognition and fine-grained visual perceiver modules for various segmentation tasks. Combined with the temporal adapter, HyperSeg achieves a comprehensive understanding of temporal information. Experimental results validate the effectiveness of our insights in resolving universal image and video segmentation tasks, including the more complex reasoning perception tasks. Our code is available.

Cong Wei, Yujie Zhong, Haoxian Tan, Yong Liu, Zheng Zhao, Jie Hu, Yujiu Yang• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy86.6	2056
Visual Question Answering	GQA	Accuracy60.9	1445
Video Object Segmentation	DAVIS 2017 (val)	--	1251
Science Question Answering	ScienceQA	Accuracy66.2	916
Reasoning Segmentation	ReasonSeg (val)	gIoU64.9	382
Referring Expression Segmentation	RefCOCO (testA)	cIoU85.7	332
Referring Expression Segmentation	RefCOCO+ (testA)	cIoU83.5	305
Reasoning Segmentation	ReasonSeg (test)	gIoU49.5	287
Referring Expression Segmentation	RefCOCO+ (val)	cIoU79	284
Referring Expression Segmentation	RefCOCO (val)	cIoU84.8	273

Showing 10 of 54 rows

Other info

Follow for update

@wizwand_team Discord