Activity Graph Transformer for Temporal Action Localization
About
We introduce Activity Graph Transformer, an end-to-end learnable model for temporal action localization, that receives a video as input and directly predicts a set of action instances that appear in the video. Detecting and localizing action instances in untrimmed videos requires reasoning over multiple action instances in a video. The dominant paradigms in the literature process videos temporally to either propose action regions or directly produce frame-level detections. However, sequential processing of videos is problematic when the action instances have non-sequential dependencies and/or non-linear temporal ordering, such as overlapping action instances or re-occurrence of action instances over the course of the video. In this work, we capture this non-linear temporal structure by reasoning over the videos as non-sequential entities in the form of graphs. We evaluate our model on challenging datasets: THUMOS14, Charades, and EPIC-Kitchens-100. Our results show that our proposed model outperforms the state-of-the-art by a considerable margin.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Temporal Action Localization | THUMOS14 (test) | AP @ IoU=0.550.2 | 319 | |
| Temporal Action Detection | THUMOS 14 | mAP@0.365 | 71 | |
| Activity Detection | Charades localize v1 | mAP28.6 | 52 | |
| Temporal Action Localization (Verb) | Epic-Kitchens-100 (val) | mAP@0.112.01 | 19 | |
| Temporal Action Localization (Noun) | Epic-Kitchens-100 (val) | mAP@0.111.63 | 17 | |
| Multi-label Temporal Action Segmentation | Charades 1.0 (test) | Seg-mAP28.6 | 14 | |
| Temporal Forgery Localization | LAV-DF 1.0 (full set) | AP@0.517.85 | 7 | |
| Temporal Forgery Localization | LAV-DF 1.0 | AP@0.515.69 | 7 |