Track-On2: Enhancing Online Point Tracking with Memory
About
In this paper, we consider the problem of long-term point tracking, which requires consistent identification of points across video frames under significant appearance changes, motion, and occlusion. We target the online setting, i.e. tracking points frame-by-frame, making it suitable for real-time and streaming applications. We extend our prior model Track-On into Track-On2, a simple and efficient transformer-based model for online long-term tracking. Track-On2 improves both performance and efficiency through architectural refinements, more effective use of memory, and improved synthetic training strategies. Unlike prior approaches that rely on full-sequence access or iterative updates, our model processes frames causally and maintains temporal coherence via a memory mechanism, which is key to handling drift and occlusions without requiring future frames. At inference, we perform coarse patch-level classification followed by refinement. Beyond architecture, we systematically study synthetic training setups and their impact on memory behavior, showing how they shape temporal robustness over long sequences. Through comprehensive experiments, Track-On2 achieves state-of-the-art results across five synthetic and real-world benchmarks, surpassing prior online trackers and even strong offline methods that exploit bidirectional context. These results highlight the effectiveness of causal, memory-based architectures trained purely on synthetic data as scalable solutions for real-world point tracking. Project page: https://kuis-ai.github.io/track_on2
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Point Tracking | DAVIS TAP-Vid | Average Jaccard (AJ)67 | 52 | |
| Point Tracking | TAP-Vid Kinetics | Overall Accuracy89.6 | 48 | |
| Point Tracking | RoboTAP | AJ68.1 | 22 | |
| Point Tracking | EgoPoints | Average Displacement X61.7 | 10 | |
| Point Tracking | Dynamic Replica | Average Displacement Error74.5 | 9 | |
| Point Tracking | EchoNet 100 videos | Average Delta34.9 | 4 | |
| Point Tracking | MSK-Bone 20 videos | Average Displacement Error32.4 | 4 | |
| Point Tracking | MSK-POI 36 videos | Delta Avg47 | 4 | |
| Point Tracking | PointOdyssey | Average Displacement Error (ADE)45.1 | 4 |