A Closer Look at Spatiotemporal Convolutions for Action Recognition

About

In this paper we discuss several forms of spatiotemporal convolutions for video analysis and study their effects on action recognition. Our motivation stems from the observation that 2D CNNs applied to individual frames of the video have remained solid performers in action recognition. In this work we empirically demonstrate the accuracy advantages of 3D CNNs over 2D CNNs within the framework of residual learning. Furthermore, we show that factorizing the 3D convolutional filters into separate spatial and temporal components yields significantly advantages in accuracy. Our empirical study leads to the design of a new spatiotemporal convolutional block "R(2+1)D" which gives rise to CNNs that achieve results comparable or superior to the state-of-the-art on Sports-1M, Kinetics, UCF101 and HMDB51.

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, Manohar Paluri• 2017

Related benchmarks

Task	Dataset	Result
Action Recognition	Kinetics-400	Top-1 Acc75.4	505
Action Recognition	UCF101	Accuracy98.17	433
Action Recognition	UCF101 (test)	Accuracy96.8	376
Action Recognition	UCF101 (mean of 3 splits)	Accuracy97.3	357
Action Recognition	HMDB51 (test)	Accuracy0.745	249
Action Recognition	Kinetics 400 (test)	Top-1 Accuracy75.4	245
Action Recognition	HMDB51	Top-1 Acc80.54	225
Action Recognition	UCF-101	Top-1 Acc96.8	225
Action Recognition	HMDB-51 (average of three splits)	Top-1 Acc78.7	204
Video Classification	Kinetics 400 (val)	Top-1 Acc74.3	204

Showing 10 of 119 rows

...

Other info

Follow for update

@wizwand_team Discord