Video Anomaly Detection by Solving Decoupled Spatio-Temporal Jigsaw Puzzles
About
Video Anomaly Detection (VAD) is an important topic in computer vision. Motivated by the recent advances in self-supervised learning, this paper addresses VAD by solving an intuitive yet challenging pretext task, i.e., spatio-temporal jigsaw puzzles, which is cast as a multi-label fine-grained classification problem. Our method exhibits several advantages over existing works: 1) the spatio-temporal jigsaw puzzles are decoupled in terms of spatial and temporal dimensions, responsible for capturing highly discriminative appearance and motion features, respectively; 2) full permutations are used to provide abundant jigsaw puzzles covering various difficulty levels, allowing the network to distinguish subtle spatio-temporal differences between normal and abnormal events; and 3) the pretext task is tackled in an end-to-end manner without relying on any pre-trained models. Our method outperforms state-of-the-art counterparts on three public benchmarks. Especially on ShanghaiTech Campus, the result is superior to reconstruction and prediction-based methods by a large margin.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Anomaly Detection | CUHK Avenue (Ave) (test) | AUC92.2 | 203 | |
| Video Anomaly Detection | ShanghaiTech (test) | AUC0.843 | 194 | |
| Abnormal Event Detection | UCSD Ped2 (test) | AUC99 | 146 | |
| Abnormal Event Detection | UCSD Ped2 | -- | 132 | |
| Video Anomaly Detection | Avenue (test) | AUC (Micro)92.2 | 85 | |
| Video Anomaly Detection | CUHK Avenue | Frame AUC91.41 | 65 | |
| Video Anomaly Detection | ShanghaiTech | Micro AUC0.843 | 51 | |
| Video Anomaly Detection | ShanghaiTech standard (test) | Frame-Level AUC84.2 | 50 | |
| Video Anomaly Detection | UBnormal (test) | AUC56.4 | 37 | |
| Action Recognition | UCF-101 fine-tuning protocol | Accuracy67.7 | 35 |