Convolutional Two-Stream Network Fusion for Video Action Recognition

About

Recent applications of Convolutional Neural Networks (ConvNets) for human action recognition in videos have proposed different solutions for incorporating the appearance and motion information. We study a number of ways of fusing ConvNet towers both spatially and temporally in order to best take advantage of this spatio-temporal information. We make the following findings: (i) that rather than fusing at the softmax layer, a spatial and temporal network can be fused at a convolution layer without loss of performance, but with a substantial saving in parameters; (ii) that it is better to fuse such networks spatially at the last convolutional layer than earlier, and that additionally fusing at the class prediction layer can boost accuracy; finally (iii) that pooling of abstract convolutional features over spatiotemporal neighbourhoods further boosts performance. Based on these studies we propose a new ConvNet architecture for spatiotemporal fusion of video snippets, and evaluate its performance on standard benchmarks where this architecture achieves state-of-the-art results.

Christoph Feichtenhofer, Axel Pinz, Andrew Zisserman• 2016

Related benchmarks

Task	Dataset	Result
Action Recognition	UCF101	Accuracy93.5	433
Action Recognition	UCF101 (mean of 3 splits)	Accuracy93.5	357
Action Recognition	HMDB51	Top-1 Acc69.2	225
Action Recognition	UCF-101	Top-1 Acc92.5	225
Action Recognition	HMDB-51 (average of three splits)	Top-1 Acc69.2	204
Action Recognition	HMDB51	3-Fold Accuracy69.2	191
Action Recognition	UCF101 (3 splits)	Accuracy93.5	155
Action Classification	HMDB51 (over all three splits)	Accuracy65.4	121
Video Action Recognition	HMDB-51 (3 splits)	Accuracy65.4	116
Action Recognition	HMDB51 (split 1)	--	99

Showing 10 of 22 rows

Other info

Code

Follow for update

@wizwand_team Discord