Two-Stream Convolutional Networks for Action Recognition in Videos

About

We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to generalise the best performing hand-crafted features within a data-driven learning framework. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multi-task learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification.

Karen Simonyan, Andrew Zisserman• 2014

Related benchmarks

Task	Dataset	Result
Action Recognition	NTU RGB+D (Cross-View)	Accuracy83.3	652
Action Recognition	NTU RGB+D (Cross-subject)	Accuracy74.4	500
Action Recognition	UCF101	Accuracy88	433
Action Recognition	UCF101 (mean of 3 splits)	Accuracy91.7	357
Action Recognition	UCF101 (test)	Accuracy92.5	357
Action Recognition	HMDB51 (test)	Accuracy0.624	249
Action Recognition	Kinetics 400 (test)	Top-1 Accuracy65.6	245
Action Recognition	NTU 120 (Cross-Setup)	Accuracy54.8	231
Action Recognition	HMDB51	Top-1 Acc59.4	225
Action Recognition	HMDB-51 (average of three splits)	Top-1 Acc59.4	204

Showing 10 of 82 rows

...

Other info

Follow for update

@wizwand_team Discord