ActionFormer: Localizing Moments of Actions with Transformers

About

Self-attention based Transformer models have demonstrated impressive results for image classification and object detection, and more recently for video understanding. Inspired by this success, we investigate the application of Transformer networks for temporal action localization in videos. To this end, we present ActionFormer -- a simple yet powerful model to identify actions in time and recognize their categories in a single shot, without using action proposals or relying on pre-defined anchor windows. ActionFormer combines a multiscale feature representation with local self-attention, and uses a light-weighted decoder to classify every moment in time and estimate the corresponding action boundaries. We show that this orchestrated design results in major improvements upon prior works. Without bells and whistles, ActionFormer achieves 71.0% mAP at tIoU=0.5 on THUMOS14, outperforming the best prior model by 14.1 absolute percentage points. Further, ActionFormer demonstrates strong results on ActivityNet 1.3 (36.6% average mAP) and EPIC-Kitchens 100 (+13.5% average mAP over prior works). Our code is available at http://github.com/happyharrycn/actionformer_release.

Chenlin Zhang, Jianxin Wu, Yin Li• 2022

Related benchmarks

Task	Dataset	Result
Temporal Action Detection	THUMOS-14 (test)	mAP@tIoU=0.571	339
Temporal Action Localization	THUMOS14 (test)	AP @ IoU=0.573	319
Temporal Action Localization	THUMOS-14 (test)	mAP@0.382.1	308
Temporal Action Localization	ActivityNet 1.3 (val)	AP@0.555.1	257
Temporal Action Detection	ActivityNet v1.3 (val)	mAP@0.554.7	185
Temporal Action Detection	ActivityNet 1.3	mAP@0.561.5	143
Temporal Action Localization	THUMOS 2014	mAP@0.3082.3	93
Temporal Action Detection	ActivityNet 1.3 (test)	Average mAP36.6	80
Temporal Action Detection	THUMOS 14	mAP@0.382.1	71
Temporal Action Localization	ActivityNet 1.3	Average mAP36.6	60

Showing 10 of 73 rows

...

Other info

Code

Follow for update

@wizwand_team Discord