Timeception for Complex Action Recognition
About
This paper focuses on the temporal aspect for recognizing human activities in videos; an important visual cue that has long been undervalued. We revisit the conventional definition of activity and restrict it to Complex Action: a set of one-actions with a weak temporal pattern that serves a specific purpose. Related works use spatiotemporal 3D convolutions with fixed kernel size, too rigid to capture the varieties in temporal extents of complex actions, and too short for long-range temporal modeling. In contrast, we use multi-scale temporal convolutions, and we reduce the complexity of 3D convolutions. The outcome is Timeception convolution layers, which reasons about minute-long temporal patterns, a factor of 8 longer than best related works. As a result, Timeception achieves impressive accuracy in recognizing the human activities of Charades, Breakfast Actions, and MultiTHUMOS. Further, we demonstrate that Timeception learns long-range temporal dependencies and tolerate temporal extents of complex actions.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | Charades | mAP0.411 | 64 | |
| Action Recognition | Charades (test) | mAP0.411 | 53 | |
| Action Recognition | Charades v1 (test) | -- | 52 | |
| Action Recognition | Breakfast | Top-1 Accuracy71.3 | 28 | |
| Single-label activity classification | Breakfast | Accuracy71.3 | 21 | |
| Action Recognition | Charades v1 (val) | mAP41.1 | 15 | |
| Human Activity Recognition | Breakfast | Accuracy71.3 | 14 | |
| Long-form Video Classification | Breakfast | Top-1 Accuracy71.3 | 14 | |
| Action Recognition | Breakfast (1357:335) | Accuracy86.9 | 13 | |
| Video Understanding | Breakfast | Top-1 Acc71.3 | 12 |