Long-term Temporal Convolutions for Action Recognition
About
Typical human actions last several seconds and exhibit characteristic spatio-temporal structure. Recent methods attempt to capture this structure and learn action representations with convolutional neural networks. Such representations, however, are typically learned at the level of a few video frames failing to model actions at their full temporal extent. In this work we learn video representations using neural networks with long-term temporal convolutions (LTC). We demonstrate that LTC-CNN models with increased temporal extents improve the accuracy of action recognition. We also study the impact of different low-level representations, such as raw values of video pixels and optical flow vector fields and demonstrate the importance of high-quality optical flow estimation for learning accurate action models. We report state-of-the-art results on two challenging benchmarks for human action recognition UCF101 (92.7%) and HMDB51 (67.2%).
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | UCF101 | Accuracy91.7 | 365 | |
| Action Recognition | UCF101 (mean of 3 splits) | Accuracy92.7 | 357 | |
| Action Recognition | UCF101 (test) | Accuracy91.7 | 307 | |
| Action Recognition | HMDB51 (test) | Accuracy0.648 | 249 | |
| Action Recognition | HMDB-51 (average of three splits) | Top-1 Acc48.7 | 204 | |
| Action Recognition | HMDB51 | 3-Fold Accuracy67.2 | 191 | |
| Action Recognition | UCF101 (3 splits) | Accuracy91.7 | 155 | |
| Action Classification | HMDB51 (over all three splits) | Accuracy64.8 | 121 | |
| Video Action Recognition | HMDB-51 (3 splits) | Accuracy64.8 | 116 | |
| Video Action Recognition | HMDB51 (avg over all splits) | Top-1 Acc64.8 | 56 |