Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SAM 2++: Tracking Anything at Any Granularity

About

Due to the varying granularity of target states across different tasks, most existing trackers are tailored to a single task, which specificity limits their generalization, preventing them from effectively utilizing multi-task training data and leading to redundancy in both model design and parameters. Although recent unified vision models share partial architectures across tasks, they usually retain task-specific interfaces and overlook the common tracking principle behind different granularities, leaving a gap for truly unified video tracking. To unify video tracking tasks, we present SAM 2++, a unified framework that can handle target states at different granularities, including masks, boxes, and points, through an integrated design of prompt encoding, output decoding, and memory representation. First, to handle different target granularities, we design task-specific prompts that map diverse task inputs into general prompt embeddings, together with a Unified Decoder that produces task results in a common output form without redesigning the overall pipeline. Next, to satisfy memory matching, the core operation of tracking, we introduce a task-adaptive memory mechanism that unifies memory across different granularities while preserving their distinct state semantics, preventing full parameter sharing from causing interference across granularities. Finally, we introduce Tracking-Any-Granularity, the first large and diverse video tracking dataset with rich annotations at three granularities. It is constructed through a customized data engine with phased manual annotation and model-assisted completion, providing a comprehensive resource for training, benchmarking, and analyzing unified tracking models. Comprehensive experiments confirm that SAM 2++ sets a new state of the art across diverse tracking tasks at different granularities, establishing a unified and robust tracking framework.

Jiaming Zhang, Cheng Liang, Yichun Yang, Chenkai Zeng, Yutao Cui, Xinwen Zhang, Xin Zhou, Kai Ma, Gangshan Wu, Limin Wang• 2025

Related benchmarks

TaskDatasetResultRank
Video Object SegmentationDAVIS 2017 (val)
J mean86.3
1226
Visual Object TrackingGOT-10k (test)
Average Overlap80.7
450
Video Object SegmentationYouTube-VOS 2019 (val)
J-Score (Seen)85.8
240
Single Object TrackingTrackingNet
Pnorm90.1
72
Video Object SegmentationLVOS v2 (val)
J&F82.2
63
Video Object SegmentationMOSE (val)
J&F Score74.6
54
Single Object TrackingVastTrack
AUC55
29
Visual Object TrackingTrackingNet
Success Rate (AUC)86
25
Single Object TrackingTAG (test)
AUC78
17
Single Object TrackingTAG (val)
AUC78.2
17
Showing 10 of 18 rows

Other info

Follow for update