TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement

About

We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence. Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations. The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS. Our model facilitates fast inference on long and high-resolution video sequences. On a modern GPU, our implementation has the capacity to track points faster than real-time, and can be flexibly extended to higher-resolution videos. Given the high-quality trajectories extracted from a large dataset, we demonstrate a proof-of-concept diffusion model which generates trajectories from static images, enabling plausible animations. Visualizations, source code, and pretrained models can be found on our project webpage.

Carl Doersch, Yi Yang, Mel Vecerik, Dilara Gokay, Ankush Gupta, Yusuf Aytar, Joao Carreira, Andrew Zisserman• 2023

Related benchmarks

Task	Dataset	Result
Point Tracking	TAP-Vid DAVIS (First)	Delta Avg (<c)70	76
Point Tracking	TAP-Vid Kinetics (First)	Avg Displacement Error (delta_avg)59.56	53
Point Tracking	DAVIS TAP-Vid	Average Jaccard (AJ)62.9	52
Point Tracking	TAP-Vid Kinetics	Overall Accuracy87.6	48
Point Tracking	DAVIS	AJ56.2	38
Point Tracking	TAP-Vid DAVIS (Strided)	Avg Delta Error73.6	33
Point Tracking	TAP-Vid RGB-Stacking (test)	AJ54.2	32
Point Tracking	TAP-Vid DAVIS (test)	AJ55.3	31
Point Tracking	TAP-Vid Kinetics (test)	Average Jitter (AJ)49.6	30
Point Tracking	TAP-Vid-Kinetics (val)	Average Displacement Error64.2	25

Showing 10 of 49 rows

Other info

Code

Follow for update

@wizwand_team Discord