Global-Local Temporal Representations For Video Person Re-Identification
About
This paper proposes the Global-Local Temporal Representation (GLTR) to exploit the multi-scale temporal cues in video sequences for video person Re-Identification (ReID). GLTR is constructed by first modeling the short-term temporal cues among adjacent frames, then capturing the long-term relations among inconsecutive frames. Specifically, the short-term temporal cues are modeled by parallel dilated convolutions with different temporal dilation rates to represent the motion and appearance of pedestrian. The long-term relations are captured by a temporal self-attention model to alleviate the occlusions and noises in video sequences. The short and long-term temporal cues are aggregated as the final GLTR by a simple single-stream CNN. GLTR shows substantial superiority to existing features learned with body part cues or metric learning on four widely-used video ReID datasets. For instance, it achieves Rank-1 Accuracy of 87.02% on MARS dataset without re-ranking, better than current state-of-the art.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Person Re-ID | MARS | Rank-1 Acc87.02 | 106 | |
| Video Person Re-ID | iLIDS-VID | Rank-186 | 80 | |
| Person Re-Identification | PRID 2011 (test) | Rank-195.5 | 48 | |
| Video Person Re-Identification | MARS (test) | Rank-187 | 35 | |
| Video Person Re-Identification | DukeMTMC-VideoReID | Rank-1 Accuracy96.3 | 26 | |
| Video Person Re-Identification | iLIDS-VID (test) | Rank-186 | 25 | |
| Video Person Re-Identification | G2A-VReID Ground to Aerial | mAP50.1 | 25 | |
| Video Person Re-Identification | PRID 2011 | Rank-1 Accuracy95.5 | 23 | |
| Video Person Re-Identification | MARS v1 (test) | mAP85.8 | 21 | |
| Video Person Re-Identification | Market-1501 v1 (test) | Rank-187 | 21 |