VPN: Learning Video-Pose Embedding for Activities of Daily Living

About

In this paper, we focus on the spatio-temporal aspect of recognizing Activities of Daily Living (ADL). ADL have two specific properties (i) subtle spatio-temporal patterns and (ii) similar visual patterns varying with time. Therefore, ADL may look very similar and often necessitate to look at their fine-grained details to distinguish them. Because the recent spatio-temporal 3D ConvNets are too rigid to capture the subtle visual patterns across an action, we propose a novel Video-Pose Network: VPN. The 2 key components of this VPN are a spatial embedding and an attention network. The spatial embedding projects the 3D poses and RGB cues in a common semantic space. This enables the action recognition framework to learn better spatio-temporal features exploiting both modalities. In order to discriminate similar actions, the attention network provides two functionalities - (i) an end-to-end learnable pose backbone exploiting the topology of human body, and (ii) a coupler to provide joint spatio-temporal attention weights across a video. Experiments show that VPN outperforms the state-of-the-art results for action classification on a large scale human activity dataset: NTU-RGB+D 120, its subset NTU-RGB+D 60, a real-world challenging human activity dataset: Toyota Smarthome and a small scale human-object interaction dataset Northwestern UCLA.

Srijan Das, Saurav Sharma, Rui Dai, Francois Bremond, Monique Thonnat• 2020

Related benchmarks

Task	Dataset	Result
Action Recognition	NTU RGB+D 120 (X-set)	Accuracy87.8	779
Action Recognition	NTU RGB+D 60 (Cross-View)	Accuracy98	601
Action Recognition	NTU RGB+D 60 (X-sub)	Accuracy95.5	496
Action Recognition	NTU RGB+D X-sub 120	Accuracy86.3	482
Action Recognition	NTU RGB-D Cross-Subject 60	Accuracy95.5	358
Action Recognition	NTU RGB+D 120 Cross-Subject	Accuracy87.8	249
Action Recognition	NTU-120 (cross-subject (xsub))	Accuracy86.3	239
Action Recognition	NTU 120 (Cross-Setup)	Accuracy87.8	231
Action Recognition	NTU RGB+D X-View 60	Accuracy98	218
Action Recognition	NW-UCLA	Top-1 Acc93.5	128

Showing 10 of 27 rows

Other info

Code

Follow for update

@wizwand_team Discord