DINO-Tracker: Taming DINO for Self-Supervised Point Tracking in a Single Video

About

We present DINO-Tracker -- a new framework for long-term dense tracking in video. The pillar of our approach is combining test-time training on a single video, with the powerful localized semantic features learned by a pre-trained DINO-ViT model. Specifically, our framework simultaneously adopts DINO's features to fit to the motion observations of the test video, while training a tracker that directly leverages the refined features. The entire framework is trained end-to-end using a combination of self-supervised losses, and regularization that allows us to retain and benefit from DINO's semantic prior. Extensive evaluation demonstrates that our method achieves state-of-the-art results on known benchmarks. DINO-tracker significantly outperforms self-supervised methods and is competitive with state-of-the-art supervised trackers, while outperforming them in challenging cases of tracking under long-term occlusions.

Narek Tumanyan, Assaf Singer, Shai Bagon, Tali Dekel• 2024

Related benchmarks

Task	Dataset	Result
Point Tracking	TAP-Vid DAVIS (Strided)	Avg Delta Error78.2	33
Video Tracking	BADJA	delta_seg14.3	15
Long-term Point Tracking	TAP-Vid DAVIS 480p (test)	Avg Temporal Error73.2	12
Point Tracking	TAP-Vid DAVIS-480	Avg Displacement Error (x)80.4	9
Video Tracking	DAVIS 480	Delta Avg80.4	6

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord