Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Activity Graph Transformer for Temporal Action Localization

About

We introduce Activity Graph Transformer, an end-to-end learnable model for temporal action localization, that receives a video as input and directly predicts a set of action instances that appear in the video. Detecting and localizing action instances in untrimmed videos requires reasoning over multiple action instances in a video. The dominant paradigms in the literature process videos temporally to either propose action regions or directly produce frame-level detections. However, sequential processing of videos is problematic when the action instances have non-sequential dependencies and/or non-linear temporal ordering, such as overlapping action instances or re-occurrence of action instances over the course of the video. In this work, we capture this non-linear temporal structure by reasoning over the videos as non-sequential entities in the form of graphs. We evaluate our model on challenging datasets: THUMOS14, Charades, and EPIC-Kitchens-100. Our results show that our proposed model outperforms the state-of-the-art by a considerable margin.

Megha Nawhal, Greg Mori• 2021

Related benchmarks

TaskDatasetResultRank
Temporal Action LocalizationTHUMOS14 (test)
AP @ IoU=0.550.2
319
Temporal Action DetectionTHUMOS 14
mAP@0.365
71
Activity DetectionCharades localize v1
mAP28.6
52
Temporal Action Localization (Verb)Epic-Kitchens-100 (val)
mAP@0.112.01
19
Temporal Action Localization (Noun)Epic-Kitchens-100 (val)
mAP@0.111.63
17
Multi-label Temporal Action SegmentationCharades 1.0 (test)
Seg-mAP28.6
14
Temporal Forgery LocalizationLAV-DF 1.0 (full set)
AP@0.517.85
7
Temporal Forgery LocalizationLAV-DF 1.0
AP@0.515.69
7
Showing 8 of 8 rows

Other info

Follow for update