SAM3-DMS: Decoupled Memory Selection for Multi-target Video Segmentation of SAM3
About
Segment Anything 3 (SAM3) has established a powerful foundation that robustly detects, segments, and tracks specified targets in videos. However, in its original implementation, its group-level collective memory selection is suboptimal for complex multi-object scenarios, as it employs a synchronized decision across all concurrent targets conditioned on their average performance, often overlooking individual reliability. To this end, we propose SAM3-DMS, a training-free decoupled strategy that utilizes fine-grained memory selection on individual objects. Experiments demonstrate that our approach achieves robust identity preservation and tracking stability. Notably, our advantage becomes more pronounced with increased target density, establishing a solid foundation for simultaneous multi-target video segmentation in the wild.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Object Segmentation | SA-V (val) | J&F Score83.3 | 74 | |
| Promptable Video Segmentation | SA-V (test) | J&F Score84.3 | 4 | |
| Promptable Video Segmentation | MOSE v2 (val) | J&F Score60.3 | 4 | |
| Promptable Concept Segmentation | SA-V (val) | cgF129.4 | 3 | |
| Promptable Concept Segmentation | YT-Temporal-1B (val) | cgF150.3 | 3 | |
| Promptable Concept Segmentation | YT-Temporal-1B (test) | cgF151 | 3 | |
| Promptable Concept Segmentation | SmartGlasses (val) | cgF133.6 | 3 | |
| Promptable Concept Segmentation | SmartGlasses (test) | cgF136.5 | 3 | |
| Promptable Concept Segmentation | SA-V (test) | cgF130.3 | 3 |