Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

A Closer Look at Spatiotemporal Convolutions for Action Recognition

About

In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained solid performers in action recognition. In this work we empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within the framework of residual learning. Furthermore, we show that factorizing the 3D convolutional filters into separate spatial and temporal components yields significantly advantages in accuracy. Our empirical study leads to the design of a new spatiotemporal convolutional block "R(2+1)D" which gives rise to CNNs that achieve results comparable or superior to the state-of-the-art on Sports-1M, Kinetics, UCF101 and HMDB51.

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, Manohar Paluri• 2017

Related benchmarks

TaskDatasetResultRank
Action RecognitionKinetics-400
Top-1 Acc75.4
413
Action RecognitionUCF101
Accuracy98.17
365
Action RecognitionUCF101 (mean of 3 splits)
Accuracy97.3
357
Action RecognitionUCF101 (test)
Accuracy96.8
307
Action RecognitionHMDB51 (test)
Accuracy0.745
249
Action RecognitionKinetics 400 (test)
Top-1 Accuracy75.4
245
Action RecognitionHMDB51
Top-1 Acc80.54
225
Action RecognitionHMDB-51 (average of three splits)
Top-1 Acc78.7
204
Video ClassificationKinetics 400 (val)
Top-1 Acc74.3
204
Action RecognitionHMDB51
3-Fold Accuracy78.7
191
Showing 10 of 88 rows
...

Other info

Follow for update