Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Beyond Short Snippets: Deep Networks for Video Classification

About

Convolutional neural networks (CNNs) have been extensively applied for image recognition problems giving state-of-the-art results on recognition, detection, segmentation and retrieval. In this work we propose and evaluate several deep neural network architectures to combine image information across a video over longer time periods than previously attempted. We propose two methods capable of handling full length videos. The first method explores various convolutional temporal feature pooling architectures, examining the various design choices which need to be made when adapting a CNN for this task. The second proposed method explicitly models the video as an ordered sequence of frames. For this purpose we employ a recurrent neural network that uses Long Short-Term Memory (LSTM) cells which are connected to the output of the underlying CNN. Our best networks exhibit significant performance improvements over previously published results on the Sports 1 million dataset (73.1% vs. 60.9%) and the UCF-101 datasets with (88.6% vs. 88.0%) and without additional optical flow information (82.6% vs. 72.8%).

Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, George Toderici• 2015

Related benchmarks

TaskDatasetResultRank
Action RecognitionUCF101
Accuracy88.6
365
Action RecognitionUCF101 (mean of 3 splits)
Accuracy88.6
357
Action RecognitionUCF101 (test)
Accuracy88.6
307
Action RecognitionUCF101 (3 splits)
Accuracy88.6
155
Video ClassificationUCF101 (3-split average)
Accuracy88.6
41
Video ClassificationUCF101 (averaged over three splits)
Accuracy88.6
39
Video ClassificationCharades
mAP17.8
38
Video ClassificationSports-1M
Hit@190.8
33
Action RecognitionUCF-101
3-Fold Accuracy88.6
32
Temporal LocalizationCharades
mAP9.6
12
Showing 10 of 15 rows

Other info

Follow for update