ConvNet Architecture Search for Spatiotemporal Feature Learning

About

Learning image representations with ConvNets by pre-training on ImageNet has proven useful across many visual understanding tasks including object detection, semantic segmentation, and image captioning. Although any image representation can be applied to video frames, a dedicated spatiotemporal representation is still vital in order to incorporate motion patterns that cannot be captured by appearance based models alone. This paper presents an empirical ConvNet architecture search for spatiotemporal feature learning, culminating in a deep 3-dimensional (3D) Residual ConvNet. Our proposed architecture outperforms C3D by a good margin on Sports-1M, UCF101, HMDB51, THUMOS14, and ASLAN while being 2 times faster at inference time, 2 times smaller in model size, and having a more compact representation.

Du Tran, Jamie Ray, Zheng Shou, Shih-Fu Chang, Manohar Paluri• 2017

Related benchmarks

Task	Dataset	Result
Action Recognition	UCF101	Accuracy85.8	433
Action Recognition	UCF101 (mean of 3 splits)	Accuracy88.6	357
Action Recognition	UCF101 (test)	Accuracy85.8	357
Action Recognition	HMDB51 (test)	Accuracy0.549	249
Action Recognition	HMDB-51 (average of three splits)	Top-1 Acc58.8	204
Action Recognition	HMDB51	3-Fold Accuracy54.9	191
Action Recognition	Kinetics-400 full (val)	Top-1 Acc73.9	141
Action Classification	HMDB51 (over all three splits)	Accuracy54.9	121
Video Action Recognition	HMDB-51 (3 splits)	Accuracy54.9	116
Action Recognition	UCF101 (Split 1)	--	105

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord