W-TALC: Weakly-supervised Temporal Activity Localization and Classification
About
Most activity localization methods in the literature suffer from the burden of frame-wise annotation requirement. Learning from weak labels may be a potential solution towards reducing such manual labeling effort. Recent years have witnessed a substantial influx of tagged videos on the Internet, which can serve as a rich source of weakly-supervised training data. Specifically, the correlations between videos with similar tags can be utilized to temporally localize the activities. Towards this goal, we present W-TALC, a Weakly-supervised Temporal Activity Localization and Classification framework using only video-level labels. The proposed network can be divided into two sub-networks, namely the Two-Stream based feature extractor network and a weakly-supervised module, which we learn by optimizing two complimentary loss functions. Qualitative and quantitative results on two challenging datasets - Thumos14 and ActivityNet1.2, demonstrate that the proposed method is able to detect activities at a fine granularity and achieve better performance than current state-of-the-art methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Temporal Action Detection | THUMOS-14 (test) | mAP@tIoU=0.522.8 | 330 | |
| Temporal Action Localization | THUMOS14 (test) | AP @ IoU=0.522.8 | 319 | |
| Temporal Action Localization | THUMOS-14 (test) | mAP@0.340.1 | 308 | |
| Temporal Action Localization | ActivityNet 1.2 (val) | mAP@IoU 0.537 | 110 | |
| Temporal Action Localization | THUMOS 2014 | mAP@0.3040.1 | 93 | |
| Temporal Action Localization | THUMOS 14 | mAP@0.340.1 | 44 | |
| Temporal Action Localization | ActivityNet 1.2 | mAP@0.537 | 32 | |
| Temporal Action Localization | THUMOS14 v1.0 (test) | mAP @ IoU 0.340.1 | 29 | |
| Temporal Action Detection | FineAction | Avg mAP3.45 | 27 | |
| Action Classification | ActivityNet Untrimmed 1.2 (test) | mAP93.2 | 12 |