Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ConvNet Architecture Search for Spatiotemporal Feature Learning

About

Learning image representations with ConvNets by pre-training on ImageNet has proven useful across many visual understanding tasks including object detection, semantic segmentation, and image captioning. Although any image representation can be applied to video frames, a dedicated spatiotemporal representation is still vital in order to incorporate motion patterns that cannot be captured by appearance based models alone. This paper presents an empirical ConvNet architecture search for spatiotemporal feature learning, culminating in a deep 3-dimensional (3D) Residual ConvNet. Our proposed architecture outperforms C3D by a good margin on Sports-1M, UCF101, HMDB51, THUMOS14, and ASLAN while being 2 times faster at inference time, 2 times smaller in model size, and having a more compact representation.

Du Tran, Jamie Ray, Zheng Shou, Shih-Fu Chang, Manohar Paluri• 2017

Related benchmarks

TaskDatasetResultRank
Action RecognitionUCF101
Accuracy85.8
365
Action RecognitionUCF101 (mean of 3 splits)
Accuracy88.6
357
Action RecognitionUCF101 (test)
Accuracy85.8
307
Action RecognitionHMDB51 (test)
Accuracy0.549
249
Action RecognitionHMDB-51 (average of three splits)
Top-1 Acc58.8
204
Action RecognitionHMDB51
3-Fold Accuracy54.9
191
Action RecognitionKinetics-400 full (val)
Top-1 Acc73.9
136
Action ClassificationHMDB51 (over all three splits)
Accuracy54.9
121
Video Action RecognitionHMDB-51 (3 splits)
Accuracy54.9
116
Action RecognitionUCF101 (Split 1)--
105
Showing 10 of 13 rows

Other info

Follow for update