Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection

About

General human action recognition requires understanding of various visual cues. In this paper, we propose a network architecture that computes and integrates the most important visual cues for action recognition: pose, motion, and the raw images. For the integration, we introduce a Markov chain model which adds cues successively. The resulting approach is efficient and applicable to action classification as well as to spatial and temporal action localization. The two contributions clearly improve the performance over respective baselines. The overall approach achieves state-of-the-art action classification performance on HMDB51, J-HMDB and NTU RGB+D datasets. Moreover, it yields state-of-the-art spatio-temporal action localization results on UCF101 and J-HMDB.

Mohammadreza Zolfaghari, Gabriel L. Oliveira, Nima Sedaghat, Thomas Brox• 2017

Related benchmarks

Task	Dataset	Result
Action Recognition	NTU RGB+D (Cross-subject)	Accuracy80.8	511
Action Recognition	UCF101 (mean of 3 splits)	Accuracy91.1	357
Action Classification	HMDB51 (over all three splits)	Accuracy69.7	121
Action Recognition	NTU RGB+D v1 (Cross-Subject (CS))	Accuracy80.8	50
Action Recognition	JHMDB Mean over 3 splits	Accuracy56.8	18
Action Classification	J-HMDB (averaged over 3 splits)	Accuracy76.1	14
Action Recognition	JHMDB	Mean Per-Class Accuracy76.1	11
Spatial action detection	J-HMDB	Video mAP (IoU=0.5)73.47	5
Spatio-temporal action detection	UCF101 (split1)	mAP (IoU=0.05)65.22	5
Action Recognition	JHMDB (1)	Accuracy45.5	2

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord