X-SAM: From Segment Anything to Any Segmentation

About

Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding. Although the Segment Anything Model (SAM) represents a significant advancement in visual-prompt-driven image segmentation, it exhibits notable limitations in multi-mask prediction and category-specific segmentation tasks, and it cannot integrate all segmentation tasks within a unified model architecture. To address these limitations, we present X-SAM, a streamlined Multimodal Large Language Model (MLLM) framework that extends the segmentation paradigm from \textit{segment anything} to \textit{any segmentation}. Specifically, we introduce a novel unified framework that enables more advanced pixel-level perceptual comprehension for MLLMs. Furthermore, we propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities. To enable effective training on diverse data sources, we present a unified training strategy that supports co-training across multiple datasets. Experimental results demonstrate that X-SAM achieves state-of-the-art performance on a wide range of image segmentation benchmarks, highlighting its efficiency for multimodal, pixel-level visual understanding. Code is available at https://github.com/wanghao9610/X-SAM.

Hao Wang, Limeng Qiao, Zequn Jie, Zhijian Huang, Chengjian Feng, Qingfang Zheng, Lin Ma, Xiangyuan Lan, Xiaodan Liang• 2025

Related benchmarks

Task	Dataset	Result
Reasoning Segmentation	ReasonSeg (val)	gIoU56.6	382
Referring Expression Segmentation	RefCOCO (testA)	cIoU87.1	332
Referring Expression Segmentation	RefCOCO+ (testA)	cIoU81	305
Reasoning Segmentation	ReasonSeg (test)	gIoU57.8	287
Referring Expression Segmentation	RefCOCO+ (val)	cIoU78	284
Referring Expression Segmentation	RefCOCO (val)	cIoU85.1	273
Referring Expression Segmentation	RefCOCO (testB)	cIoU83.4	259
Referring Expression Segmentation	RefCOCO+ (testB)	cIoU74.4	256
Referring Expression Segmentation	RefCOCOg (val)	cIoU83.8	185
Referring Expression Segmentation	RefCOCOg (test)	cIoU83.9	183

Showing 10 of 31 rows

Other info

Follow for update

@wizwand_team Discord