SAM3-I: Segment Anything with Instructions

About

Segment Anything Model 3 (SAM3) advances open-vocabulary segmentation through promptable concept segmentation, enabling users to segment all instances associated with a given concept using short noun-phrase (NP) prompts. While effective for concept-level grounding, real-world interactions often involve far richer natural-language instructions that combine attributes, relations, actions, states, or implicit reasoning. Currently, SAM3 relies on external multi-modal agents to convert complex instructions into NPs and conducts iterative mask filtering, leading to coarse representations and limited instance specificity. In this work, we present SAM3-I, an instruction-following extension of the SAM family that unifies concept-level grounding and instruction-level reasoning within a single segmentation framework. Built upon SAM3, SAM3-I introduces an instruction-aware cascaded adaptation mechanism with dedicated alignment losses that progressively aligns expressive instruction semantics with SAM3's vision-language representations, enabling direct interpretation of natural-language instructions while preserving its strong concept recall ability. To enable instruction-following learning, we introduce HMPL-Instruct, a large-scale instruction-centric dataset that systematically covers hierarchical instruction semantics and diverse target granularities. Experiments demonstrate that SAM3-I achieves appealing performance across referring and reasoning-based segmentation, showing that SAM3 can be effectively extended to follow complex natural-language instructions without sacrificing its original concept-driven strengths. Code and dataset are available at https://github.com/debby-0527/SAM3-I.

Jingjing Li, Yue Feng, Yuchen Guo, Jincai Huang, Wei Ji, Qi Bi, Yongri Piao, Miao Zhang, Xiaoqi Zhao, Qiang Chen, Shihao Zou, Huchuan Lu, Li Cheng• 2025

Related benchmarks

Task	Dataset	Result
Referring Expression Segmentation	RefCOCO+ (val)	cIoU29.44	284
Referring Expression Segmentation	RefCOCO (val)	cIoU39.4	273
Referring Expression Segmentation	RefCOCOg (val)	cIoU32.76	185
Reasoning Segmentation	Intent2Part InstructPart	mIoU68.9	9
Referring Segmentation	Intent2Part InstructPart	mIoU70.1	9
Intent-level Segmentation	Intent2Part clean (test)	mIoU34.2	9
Intent-level Segmentation	Intent2Part (test)	mIoU31.8	9
Affordance Grounding	A2A-Bench Multi-instance 1.0 (test)	sIoU46.82	8
Affordance Grounding	A2A-Bench Single-instance 1.0 (test)	gIoU44.58	8
TRPS affordance grounding	A2A-Bench	gIoU50.99	4

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord