Background Suppression Network for Weakly-supervised Temporal Action Localization
About
Weakly-supervised temporal action localization is a very challenging problem because frame-wise labels are not given in the training stage while the only hint is video-level labels: whether each video contains action frames of interest. Previous methods aggregate frame-level class scores to produce video-level prediction and learn from video-level action labels. This formulation does not fully model the problem in that background frames are forced to be misclassified as action classes to predict video-level labels accurately. In this paper, we design Background Suppression Network (BaS-Net) which introduces an auxiliary class for background and has a two-branch weight-sharing architecture with an asymmetrical training strategy. This enables BaS-Net to suppress activations from background frames to improve localization performance. Extensive experiments demonstrate the effectiveness of BaS-Net and its superiority over the state-of-the-art methods on the most popular benchmarks - THUMOS'14 and ActivityNet. Our code and the trained model are available at https://github.com/Pilhyeon/BaSNet-pytorch.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Temporal Action Localization | THUMOS14 (test) | AP @ IoU=0.527 | 319 | |
| Temporal Action Localization | THUMOS-14 (test) | mAP@0.344.6 | 308 | |
| Temporal Action Localization | ActivityNet 1.3 (val) | AP@0.534.5 | 257 | |
| Temporal Action Localization | ActivityNet 1.2 (val) | mAP@IoU 0.538.5 | 110 | |
| Temporal Action Localization | THUMOS 2014 | mAP@0.3044.6 | 93 | |
| Temporal Action Localization | ActivityNet v1.3 (test) | mAP @ IoU=0.534.5 | 47 | |
| Temporal Action Localization | THUMOS 14 | mAP@0.344.6 | 44 | |
| Temporal Action Localization | ActivityNet 1.2 (test) | mAP@0.538.5 | 36 | |
| Temporal Action Localization | ActivityNet 1.2 | mAP@0.538.5 | 32 | |
| Temporal Action Localization | ActivityNet 1.3 | Average mAP22.2 | 32 |