STM: SpatioTemporal and Motion Encoding for Action Recognition
About
Spatiotemporal and motion features are two complementary and crucial information for video action recognition. Recent state-of-the-art methods adopt a 3D CNN stream to learn spatiotemporal features and another flow stream to learn motion features. In this work, we aim to efficiently encode these two features in a unified 2D framework. To this end, we first propose an STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and a Channel-wise Motion Module (CMM) to efficiently encode motion features. We then replace original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network by introducing very limited extra computation cost. Extensive experiments demonstrate that the proposed STM network outperforms the state-of-the-art methods on both temporal-related datasets (i.e., Something-Something v1 & v2 and Jester) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51) with the help of encoding spatiotemporal and motion features together.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | Something-Something v2 (val) | Top-1 Accuracy64.3 | 535 | |
| Action Recognition | Kinetics-400 | Top-1 Acc73.7 | 413 | |
| Action Recognition | UCF101 | Accuracy96.2 | 365 | |
| Action Recognition | UCF101 (mean of 3 splits) | Accuracy96.2 | 357 | |
| Action Recognition | Something-Something v2 | Top-1 Accuracy64.2 | 341 | |
| Action Recognition | Something-Something v2 (test) | Top-1 Acc64.2 | 333 | |
| Action Recognition | Something-something v1 (val) | Top-1 Acc50.7 | 257 | |
| Action Recognition | Kinetics 400 (test) | Top-1 Accuracy73.7 | 245 | |
| Action Recognition | HMDB51 | Top-1 Acc72.2 | 225 | |
| Action Recognition | HMDB-51 (average of three splits) | Top-1 Acc72.2 | 204 |