Weakly Supervised Action Localization by Sparse Temporal Pooling Network
About
We propose a weakly supervised temporal action localization algorithm on untrimmed videos using convolutional neural networks. Our algorithm learns from video-level class labels and predicts temporal intervals of human actions with no requirement of temporal localization annotations. We design our network to identify a sparse subset of key segments associated with target actions in a video using an attention module and fuse the key segments through adaptive temporal pooling. Our loss function is comprised of two terms that minimize the video-level action classification error and enforce the sparsity of the segment selection. At inference time, we extract and score temporal proposals using temporal class activations and class-agnostic attentions to estimate the time intervals that correspond to target actions. The proposed algorithm attains state-of-the-art results on the THUMOS14 dataset and outstanding performance on ActivityNet1.3 even with its weak supervision.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Temporal Action Detection | THUMOS-14 (test) | mAP@tIoU=0.516.9 | 330 | |
| Temporal Action Localization | THUMOS14 (test) | AP @ IoU=0.519.8 | 319 | |
| Temporal Action Localization | THUMOS-14 (test) | mAP@0.335.5 | 308 | |
| Temporal Action Localization | ActivityNet 1.3 (val) | AP@0.529.8 | 257 | |
| Temporal Action Detection | ActivityNet v1.3 (val) | mAP@0.529.3 | 185 | |
| Temporal Action Localization | THUMOS 2014 | mAP@0.3035.5 | 93 | |
| Temporal Action Detection | ActivityNet 1.3 (test) | Average mAP20.07 | 80 | |
| Temporal Action Localization | THUMOS 14 | mAP@0.335.5 | 44 | |
| Temporal Action Localization | THUMOS 2014 (test) | mAP (theta=0.5)16.9 | 35 | |
| Temporal Action Localization | ActivityNet 1.3 | -- | 32 |