SAM 2++: Tracking Anything at Any Granularity
About
Due to the varying granularity of target states across different tasks, most existing trackers are tailored to a single task, which specificity limits their generalization, preventing them from effectively utilizing multi-task training data and leading to redundancy in both model design and parameters. Although recent unified vision models share partial architectures across tasks, they usually retain task-specific interfaces and overlook the common tracking principle behind different granularities, leaving a gap for truly unified video tracking. To unify video tracking tasks, we present SAM 2++, a unified framework that can handle target states at different granularities, including masks, boxes, and points, through an integrated design of prompt encoding, output decoding, and memory representation. First, to handle different target granularities, we design task-specific prompts that map diverse task inputs into general prompt embeddings, together with a Unified Decoder that produces task results in a common output form without redesigning the overall pipeline. Next, to satisfy memory matching, the core operation of tracking, we introduce a task-adaptive memory mechanism that unifies memory across different granularities while preserving their distinct state semantics, preventing full parameter sharing from causing interference across granularities. Finally, we introduce Tracking-Any-Granularity, the first large and diverse video tracking dataset with rich annotations at three granularities. It is constructed through a customized data engine with phased manual annotation and model-assisted completion, providing a comprehensive resource for training, benchmarking, and analyzing unified tracking models. Comprehensive experiments confirm that SAM 2++ sets a new state of the art across diverse tracking tasks at different granularities, establishing a unified and robust tracking framework.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Object Segmentation | DAVIS 2017 (val) | J mean86.3 | 1226 | |
| Visual Object Tracking | GOT-10k (test) | Average Overlap80.7 | 450 | |
| Video Object Segmentation | YouTube-VOS 2019 (val) | J-Score (Seen)85.8 | 240 | |
| Single Object Tracking | TrackingNet | Pnorm90.1 | 72 | |
| Video Object Segmentation | LVOS v2 (val) | J&F82.2 | 63 | |
| Video Object Segmentation | MOSE (val) | J&F Score74.6 | 54 | |
| Single Object Tracking | VastTrack | AUC55 | 29 | |
| Visual Object Tracking | TrackingNet | Success Rate (AUC)86 | 25 | |
| Single Object Tracking | TAG (test) | AUC78 | 17 | |
| Single Object Tracking | TAG (val) | AUC78.2 | 17 |