ODTrack: Online Dense Temporal Token Learning for Visual Tracking
About
Online contextual reasoning and association across consecutive video frames are critical to perceive instances in visual tracking. However, most current top-performing trackers persistently lean on sparse temporal relationships between reference and search frames via an offline mode. Consequently, they can only interact independently within each image-pair and establish limited temporal correlations. To alleviate the above problem, we propose a simple, flexible and effective video-level tracking pipeline, named \textbf{ODTrack}, which densely associates the contextual relationships of video frames in an online token propagation manner. ODTrack receives video frames of arbitrary length to capture the spatio-temporal trajectory relationships of an instance, and compresses the discrimination features (localization information) of a target into a token sequence to achieve frame-to-frame association. This new solution brings the following benefits: 1) the purified token sequences can serve as prompts for the inference in the next video frame, whereby past information is leveraged to guide future inference; 2) the complex online update strategies are effectively avoided by the iterative propagation of token sequences, and thus we can achieve more efficient model representation and computation. ODTrack achieves a new \textit{SOTA} performance on seven benchmarks, while running at real-time speed. Code and models are available at \url{https://github.com/GXNU-ZhongLab/ODTrack}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Object Tracking | TrackingNet (test) | Normalized Precision (Pnorm)91 | 460 | |
| Visual Object Tracking | LaSOT (test) | AUC74 | 444 | |
| Visual Object Tracking | GOT-10k (test) | Average Overlap78.2 | 378 | |
| Object Tracking | LaSoT | AUC74 | 333 | |
| Object Tracking | TrackingNet | Precision (P)86.7 | 225 | |
| Visual Object Tracking | GOT-10k | AO78.2 | 223 | |
| Visual Object Tracking | VOT 2020 (test) | EAO0.605 | 147 | |
| Visual Object Tracking | OTB-100 | AUC72.4 | 136 | |
| Visual Object Tracking | TNL2K | AUC61.7 | 95 | |
| Visual Object Tracking | LaSoText | Precision61.7 | 88 |