SAM 2++: Tracking Anything at Any Granularity

About

Due to the varying granularity of target states across different tasks, most existing trackers are tailored to a single task, which specificity limits their generalization, preventing them from effectively utilizing multi-task training data and leading to redundancy in both model design and parameters. Although recent unified vision models share partial architectures across tasks, they usually retain task-specific interfaces and overlook the common tracking principle behind different granularities, leaving a gap for truly unified video tracking. To unify video tracking tasks, we present SAM 2++, a unified framework that can handle target states at different granularities, including masks, boxes, and points, through an integrated design of prompt encoding, output decoding, and memory representation. First, to handle different target granularities, we design task-specific prompts that map diverse task inputs into general prompt embeddings, together with a Unified Decoder that produces task results in a common output form without redesigning the overall pipeline. Next, to satisfy memory matching, the core operation of tracking, we introduce a task-adaptive memory mechanism that unifies memory across different granularities while preserving their distinct state semantics, preventing full parameter sharing from causing interference across granularities. Finally, we introduce Tracking-Any-Granularity, the first large and diverse video tracking dataset with rich annotations at three granularities. It is constructed through a customized data engine with phased manual annotation and model-assisted completion, providing a comprehensive resource for training, benchmarking, and analyzing unified tracking models. Comprehensive experiments confirm that SAM 2++ sets a new state of the art across diverse tracking tasks at different granularities, establishing a unified and robust tracking framework.

Jiaming Zhang, Cheng Liang, Yichun Yang, Chenkai Zeng, Yutao Cui, Xinwen Zhang, Xin Zhou, Kai Ma, Gangshan Wu, Limin Wang• 2025

Related benchmarks

Task	Dataset	Result
Video Object Segmentation	DAVIS 2017 (val)	J mean86.3	1251
Visual Object Tracking	GOT-10k (test)	Average Overlap80.7	461
Video Object Segmentation	YouTube-VOS 2019 (val)	J-Score (Seen)85.8	240
Single Object Tracking	TrackingNet	Pnorm90.1	84
Visual Object Tracking	TrackingNet	Success Rate (AUC)86	64
Video Object Segmentation	MOSE (val)	J&F Score74.6	64
Video Object Segmentation	LVOS v2 (val)	J&F82.2	63
Single Object Tracking	VastTrack	AUC55	29
Single Object Tracking	TAG (test)	AUC78	17
Single Object Tracking	TAG (val)	AUC78.2	17

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord