TAPNext: Tracking Any Point (TAP) as Next Token Prediction
About
Tracking Any Point (TAP) in a video is a challenging computer vision problem with many demonstrated applications in robotics, video editing, and 3D reconstruction. Existing methods for TAP rely heavily on complex tracking-specific inductive biases and heuristics, limiting their generality and potential for scaling. To address these challenges, we present TAPNext, a new approach that casts TAP as sequential masked token decoding. Our model is causal, tracks in a purely online fashion, and removes tracking-specific inductive biases. This enables TAPNext to run with minimal latency, and removes the temporal windowing required by many existing state of art trackers. Despite its simplicity, TAPNext achieves a new state-of-the-art tracking performance among both online and offline trackers. Finally, we present evidence that many widely used tracking heuristics emerge naturally in TAPNext through end-to-end training. The TAPNext model and code can be found at https://tap-next.github.io/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Point Tracking | TAP-Vid DAVIS (First) | Delta Avg (<c)76.6 | 76 | |
| Point Tracking | TAP-Vid Kinetics (First) | Avg Displacement Error (delta_avg)64.46 | 53 | |
| Point Tracking | DAVIS TAP-Vid | Average Jaccard (AJ)65.2 | 52 | |
| Point Tracking | TAP-Vid Kinetics | Overall Accuracy90.06 | 48 | |
| Point Tracking | TAP-Vid DAVIS (Strided) | Avg Delta Error79.7 | 33 | |
| Point Tracking | RoboTAP | AJ59.5 | 22 | |
| Point Tracking | TAP-Vid Kubric (subset of 30 videos) | AJ80.91 | 12 | |
| Point Tracking | EgoPoints | Average Displacement X31.8 | 10 | |
| Point Tracking | Dynamic Replica | Average Displacement Error46.2 | 9 | |
| Point Tracking | RoboTAP First | Average Jitter (AJ)59.8 | 8 |