Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection

About

General human action recognition requires understanding of various visual cues. In this paper, we propose a network architecture that computes and integrates the most important visual cues for action recognition: pose, motion, and the raw images. For the integration, we introduce a Markov chain model which adds cues successively. The resulting approach is efficient and applicable to action classification as well as to spatial and temporal action localization. The two contributions clearly improve the performance over respective baselines. The overall approach achieves state-of-the-art action classification performance on HMDB51, J-HMDB and NTU RGB+D datasets. Moreover, it yields state-of-the-art spatio-temporal action localization results on UCF101 and J-HMDB.

Mohammadreza Zolfaghari, Gabriel L. Oliveira, Nima Sedaghat, Thomas Brox• 2017

Related benchmarks

TaskDatasetResultRank
Action RecognitionNTU RGB+D (Cross-subject)
Accuracy80.8
474
Action RecognitionUCF101 (mean of 3 splits)
Accuracy91.1
357
Action ClassificationHMDB51 (over all three splits)
Accuracy69.7
121
Action RecognitionNTU RGB+D v1 (Cross-Subject (CS))
Accuracy80.8
50
Action RecognitionJHMDB Mean over 3 splits
Accuracy56.8
18
Action ClassificationJ-HMDB (averaged over 3 splits)
Accuracy76.1
14
Action RecognitionJHMDB
Mean Per-Class Accuracy76.1
11
Spatial action detectionJ-HMDB
Video mAP (IoU=0.5)73.47
5
Spatio-temporal action detectionUCF101 (split1)
mAP (IoU=0.05)65.22
5
Action RecognitionJHMDB (1)
Accuracy45.5
2
Showing 10 of 11 rows

Other info

Follow for update