Motion Feature Network: Fixed Motion Filter for Action Recognition
About
Spatio-temporal representations in frame sequences play an important role in the task of action recognition. Previously, a method of using optical flow as a temporal information in combination with a set of RGB images that contain spatial information has shown great performance enhancement in the action recognition tasks. However, it has an expensive computational cost and requires two-stream (RGB and optical flow) framework. In this paper, we propose MFNet (Motion Feature Network) containing motion blocks which make it possible to encode spatio-temporal information between adjacent frames in a unified network that can be trained end-to-end. The motion block can be attached to any existing CNN-based action recognition frameworks with only a small additional cost. We evaluated our network on two of the action recognition datasets (Jester and Something-Something) and achieved competitive performances for both datasets by training the networks from scratch.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | Something-something v1 (val) | Top-1 Acc43.9 | 257 | |
| Action Recognition | HMDB51 | 3-Fold Accuracy56.8 | 191 | |
| Action Recognition | Something-something v1 (test) | Top-1 Accuracy43.9 | 189 | |
| Action Recognition | Something-Something V1 | Top-1 Acc43.9 | 162 | |
| Video Classification | Something-something v1 (test) | Top-1 Accuracy43.9 | 115 | |
| Action Recognition | HMDB51 (split 1) | Top-1 Acc56.8 | 75 | |
| Video Classification | Something-something v1 (val) | Top-1 Acc43.9 | 75 | |
| Action Recognition | Something-Something V1 (test val) | Top-1 Acc43.9 | 48 | |
| Action Recognition | Jester (val) | Top-1 Accuracy96.68 | 44 | |
| Action Recognition | Something-Something (val) | Top-1 Accuracy43.92 | 18 |