Spatially and Temporally Efficient Non-local Attention Network for Video-based Person Re-Identification
About
Video-based person re-identification (Re-ID) aims at matching video sequences of pedestrians across non-overlapping cameras. It is a practical yet challenging task of how to embed spatial and temporal information of a video into its feature representation. While most existing methods learn the video characteristics by aggregating image-wise features and designing attention mechanisms in Neural Networks, they only explore the correlation between frames at high-level features. In this work, we target at refining the intermediate features as well as high-level features with non-local attention operations and make two contributions. (i) We propose a Non-local Video Attention Network (NVAN) to incorporate video characteristics into the representation at multiple feature levels. (ii) We further introduce a Spatially and Temporally Efficient Non-local Video Attention Network (STE-NVAN) to reduce the computation complexity by exploring spatial and temporal redundancy presented in pedestrian videos. Extensive experiments show that our NVAN outperforms state-of-the-arts by 3.8% in rank-1 accuracy on MARS dataset and confirms our STE-NVAN displays a much superior computation footprint compared to existing methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Person Re-ID | MARS | Rank-1 Acc88.9 | 106 | |
| Person Re-Identification | MARS (test) | Rank-188.9 | 72 | |
| Person Re-Identification | MARS | Rank-190 | 67 | |
| Video Person Re-Identification | DukeMTMC-VideoReID | Rank-1 Accuracy95.2 | 26 | |
| Video-to-Video Person Re-identification | MARS (test) | Top-1 Accuracy90 | 22 | |
| Video Person Re-Identification | Market-1501 v1 (test) | Rank-190 | 21 | |
| Video Person Re-Identification | MARS v1 (test) | mAP82.3 | 21 | |
| Image-to-Video Person Re-identification | DukeMTMC-VideoReID (test) | Top-1 Acc95.2 | 16 | |
| Video-based Person Re-identification | DukeV | R196.3 | 15 | |
| Video-to-shop retrieval | MultiDeepFashion 2 (test) | T-1 Accuracy22 | 13 |