Self-Supervised Any-Point Tracking by Contrastive Random Walks

About

We present a simple, self-supervised approach to the Tracking Any Point (TAP) problem. We train a global matching transformer to find cycle consistent tracks through video via contrastive random walks, using the transformer's attention-based global matching to define the transition matrices for a random walk on a space-time graph. The ability to perform "all pairs" comparisons between points allows the model to obtain high spatial precision and to obtain a strong contrastive learning signal, while avoiding many of the complexities of recent approaches (such as coarse-to-fine matching). To do this, we propose a number of design decisions that allow global matching architectures to be trained through self-supervision using cycle consistency. For example, we identify that transformer-based methods are sensitive to shortcut solutions, and propose a data augmentation scheme to address them. Our method achieves strong performance on the TapVid benchmarks, outperforming previous self-supervised tracking methods, such as DIFT, and is competitive with several supervised methods.

Ayush Shrivastava, Andrew Owens• 2024

Related benchmarks

Task	Dataset	Result
Point Tracking	TAP-Vid DAVIS (First)	Delta Avg (<c)54.59	76
Point Tracking	TAP-Vid Kinetics (First)	Avg Displacement Error (delta_avg)41.63	53
Point Tracking	DAVIS TAP-Vid	Average Jaccard (AJ)36.47	52
Point Tracking	TAP-Vid Kinetics	Overall Accuracy71.33	48
Point Tracking	DAVIS	AJ30.3	38
Point Tracking	Kinetics	delta_avg52.3	24
Point Tracking	Kubric	AJ54.2	18
Point Tracking	TAP-Vid Kubric (subset of 30 videos)	AJ55.04	12

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord