Learning to track for spatio-temporal action localization
About
We propose an effective approach for spatio-temporal action localization in realistic videos. The approach first detects proposals at the frame-level and scores them with a combination of static and motion CNN features. It then tracks high-scoring proposals throughout the video using a tracking-by-detection approach. Our tracker relies simultaneously on instance-level and class-level detectors. The tracks are scored using a spatio-temporal motion histogram, a descriptor at the track level, in combination with the CNN features. Finally, we perform temporal localization of the action using a sliding-window approach at the track level. We present experimental results for spatio-temporal localization on the UCF-Sports, J-HMDB and UCF-101 action localization datasets, where our approach outperforms the state of the art with a margin of 15%, 7% and 12% respectively in mAP.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Detection | JHMDB (test) | F@0.545.8 | 11 | |
| Spatio-temporal action detection | UCF101D | video-mAP (IoU=0.2)46.8 | 11 | |
| Spatio-temporal action detection | J-HMDB (3 splits) | video-mAP (IoU=0.2)63.1 | 10 | |
| Action Detection | UCF-101-24 (split 1) | -- | 10 | |
| Action Detection | JHMDB (average over three splits) | Frame mAP0.458 | 6 | |
| Spatio-temporal action detection | UCF101 (split1) | mAP (IoU=0.05)62.8 | 5 | |
| Spatial action detection | J-HMDB | Video mAP (IoU=0.5)60.7 | 5 | |
| Action Detection | UCF Sports (test) | Diving Score60.71 | 4 | |
| Action Detection | UCF-101 24 actions | f-mAP35.84 | 3 |