ActionVLAD: Learning spatio-temporal aggregation for action classification

About

In this work, we introduce a new video representation for action classification that aggregates local convolutional features across the entire spatio-temporal extent of the video. We do so by integrating state-of-the-art two-stream networks with learnable spatio-temporal feature aggregation. The resulting architecture is end-to-end trainable for whole-video classification. We investigate different strategies for pooling across space and time and combining signals from the different streams. We find that: (i) it is important to pool jointly across space and time, but (ii) appearance and motion streams are best aggregated into their own separate representations. Finally, we show that our representation outperforms the two-stream base architecture by a large margin (13% relative) as well as out-performs other baselines with comparable base architectures on HMDB51, UCF101, and Charades video classification benchmarks.

Rohit Girdhar, Deva Ramanan, Abhinav Gupta, Josef Sivic, Bryan Russell• 2017

Related benchmarks

Task	Dataset	Result
Action Recognition	UCF101	Accuracy93.6	433
Action Recognition	UCF101 (mean of 3 splits)	Accuracy93.6	357
Action Recognition	HMDB51	Top-1 Acc69.8	225
Action Recognition	HMDB-51 (average of three splits)	Top-1 Acc69.8	204
Action Recognition	HMDB51	3-Fold Accuracy69.8	191
Action Recognition	UCF101 (3 splits)	Accuracy93.6	155
Action Classification	HMDB51 (over all three splits)	Accuracy49.8	121
Action Recognition	HMDB51 (split 1)	--	80
Action Recognition	Charades	mAP0.21	64
Action Classification	HMDB51 (split1)	Accuracy51.2	58

Showing 10 of 19 rows

Other info

Code

Follow for update

@wizwand_team Discord