SAM3-I: Segment Anything with Instructions
About
Segment Anything Model 3 (SAM3) advances open-vocabulary segmentation through promptable concept segmentation, enabling users to segment all instances associated with a given concept using short noun-phrase (NP) prompts. While effective for concept-level grounding, real-world interactions often involve far richer natural-language instructions that combine attributes, relations, actions, states, or implicit reasoning. Currently, SAM3 relies on external multi-modal agents to convert complex instructions into NPs and conducts iterative mask filtering, leading to coarse representations and limited instance specificity. In this work, we present SAM3-I, an instruction-following extension of the SAM family that unifies concept-level grounding and instruction-level reasoning within a single segmentation framework. Built upon SAM3, SAM3-I introduces an instruction-aware cascaded adaptation mechanism with dedicated alignment losses that progressively aligns expressive instruction semantics with SAM3's vision-language representations, enabling direct interpretation of natural-language instructions while preserving its strong concept recall ability. To enable instruction-following learning, we introduce HMPL-Instruct, a large-scale instruction-centric dataset that systematically covers hierarchical instruction semantics and diverse target granularities. Experiments demonstrate that SAM3-I achieves appealing performance across referring and reasoning-based segmentation, showing that SAM3 can be effectively extended to follow complex natural-language instructions without sacrificing its original concept-driven strengths. Code and dataset are available at https://github.com/debby-0527/SAM3-I.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reasoning Segmentation | Intent2Part InstructPart | mIoU68.9 | 9 | |
| Referring Segmentation | Intent2Part InstructPart | mIoU70.1 | 9 | |
| Intent-level Segmentation | Intent2Part clean (test) | mIoU34.2 | 9 | |
| Intent-level Segmentation | Intent2Part (test) | mIoU31.8 | 9 | |
| Complex Instruction-Following Segmentation | PACO-LVIS-Instruct Complex (test) | gIoU51 | 3 | |
| Simple Instruction-Following Segmentation | PACO-LVIS-Instruct Simple Instruct. (test) | gIoU54 | 3 | |
| Concept-level Grounding | PACO-LVIS-Instruct Concept-level (test) | gIoU48.9 | 2 |