Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

BootsTAP: Bootstrapped Training for Tracking-Any-Point

About

To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes. This can be formalized as Tracking-Any-Point (TAP), which requires the algorithm to track any point on solid surfaces in a video, potentially densely in space and time. Large-scale groundtruth training data for TAP is only available in simulation, which currently has a limited variety of objects and motion. In this work, we demonstrate how large-scale, unlabeled, uncurated real-world data can improve a TAP model with minimal architectural changes, using a selfsupervised student-teacher setup. We demonstrate state-of-the-art performance on the TAP-Vid benchmark surpassing previous results by a wide margin: for example, TAP-Vid-DAVIS performance improves from 61.3% to 67.4%, and TAP-Vid-Kinetics from 57.2% to 62.5%. For visualizations, see our project webpage at https://bootstap.github.io/

Carl Doersch, Pauline Luc, Yi Yang, Dilara Gokay, Skanda Koppula, Ankush Gupta, Joseph Heyward, Ignacio Rocco, Ross Goroshin, Jo\~ao Carreira, Andrew Zisserman• 2024

Related benchmarks

TaskDatasetResultRank
Point TrackingDAVIS TAP-Vid
Average Jaccard (AJ)61.4
41
Point TrackingDAVIS
AJ61.4
38
Point TrackingTAP-Vid Kinetics
Overall Accuracy86.5
37
Point TrackingTAP-Vid RGB-Stacking (test)
AJ72.4
32
Point TrackingTAP-Vid DAVIS (test)
AJ61.4
31
Point TrackingTAP-Vid Kinetics (test)
Average Jitter (AJ)54.6
30
Point TrackingTAP-Vid-Kinetics (val)
Average Displacement Error68.4
25
Point TrackingDAVIS TAP-Vid (val)
AJ62.4
19
Point TrackingRGB-Stacking
Average Delta83
13
Point TrackingAllTracker benchmark suite
Dav. Average Error67.9
13
Showing 10 of 23 rows

Other info

Code

Follow for update