TAPIR: Tracking Any Point with per-frame Initialization and temporal Refinement
About
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence. Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations. The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS. Our model facilitates fast inference on long and high-resolution video sequences. On a modern GPU, our implementation has the capacity to track points faster than real-time, and can be flexibly extended to higher-resolution videos. Given the high-quality trajectories extracted from a large dataset, we demonstrate a proof-of-concept diffusion model which generates trajectories from static images, enabling plausible animations. Visualizations, source code, and pretrained models can be found on our project webpage.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Point Tracking | DAVIS TAP-Vid | Average Jaccard (AJ)62.9 | 41 | |
| Point Tracking | DAVIS | AJ56.2 | 38 | |
| Point Tracking | TAP-Vid Kinetics | Overall Accuracy87.6 | 37 | |
| Point Tracking | TAP-Vid RGB-Stacking (test) | AJ54.2 | 32 | |
| Point Tracking | TAP-Vid DAVIS (test) | AJ55.3 | 31 | |
| Point Tracking | TAP-Vid Kinetics (test) | Average Jitter (AJ)49.6 | 30 | |
| Point Tracking | TAP-Vid-Kinetics (val) | Average Displacement Error64.2 | 25 | |
| Point Tracking | Kinetics | delta_avg64.2 | 24 | |
| Point Tracking | DAVIS TAP-Vid (val) | AJ61.3 | 19 | |
| Point Tracking | TAP-Vid DAVIS (First) | Delta Avg (<c)70 | 19 |