Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Appearance-and-Relation Networks for Video Classification

About

Spatiotemporal feature learning in videos is a fundamental problem in computer vision. This paper presents a new architecture, termed as Appearance-and-Relation Network (ARTNet), to learn video representation in an end-to-end manner. ARTNets are constructed by stacking multiple generic building blocks, called as SMART, whose goal is to simultaneously model appearance and relation from RGB input in a separate and explicit manner. Specifically, SMART blocks decouple the spatiotemporal learning module into an appearance branch for spatial modeling and a relation branch for temporal modeling. The appearance branch is implemented based on the linear combination of pixels or filter responses in each frame, while the relation branch is designed based on the multiplicative interactions between pixels or filter responses across multiple frames. We perform experiments on three action recognition benchmarks: Kinetics, UCF101, and HMDB51, demonstrating that SMART blocks obtain an evident improvement over 3D convolutions for spatiotemporal feature learning. Under the same training setting, ARTNets achieve superior performance on these three datasets to the existing state-of-the-art methods.

Limin Wang, Wei Li, Wen Li, Luc Van Gool• 2017

Related benchmarks

TaskDatasetResultRank
Action RecognitionKinetics-400
Top-1 Acc70.7
413
Action RecognitionUCF101
Accuracy94.3
365
Action RecognitionUCF101 (mean of 3 splits)
Accuracy94.3
357
Action RecognitionUCF101 (test)
Accuracy94.3
307
Action RecognitionHMDB51 (test)
Accuracy0.709
249
Action RecognitionKinetics 400 (test)
Top-1 Accuracy70.7
245
Action RecognitionHMDB51
Top-1 Acc70.9
225
Action RecognitionHMDB-51 (average of three splits)
Top-1 Acc70.9
204
Video ClassificationKinetics 400 (val)
Top-1 Acc69.2
204
Action RecognitionUCF101 (3 splits)
Accuracy94.3
155
Showing 10 of 27 rows

Other info

Code

Follow for update