HaLP: Hallucinating Latent Positives for Skeleton-based Self-Supervised Learning of Actions

About

Supervised learning of skeleton sequence encoders for action recognition has received significant attention in recent times. However, learning such encoders without labels continues to be a challenging problem. While prior works have shown promising results by applying contrastive learning to pose sequences, the quality of the learned representations is often observed to be closely tied to data augmentations that are used to craft the positives. However, augmenting pose sequences is a difficult task as the geometric constraints among the skeleton joints need to be enforced to make the augmentations realistic for that action. In this work, we propose a new contrastive learning approach to train models for skeleton-based action recognition without labels. Our key contribution is a simple module, HaLP - to Hallucinate Latent Positives for contrastive learning. Specifically, HaLP explores the latent space of poses in suitable directions to generate new positives. To this end, we present a novel optimization formulation to solve for the synthetic positives with an explicit control on their hardness. We propose approximations to the objective, making them solvable in closed form with minimal overhead. We show via experiments that using these generated positives within a standard contrastive learning framework leads to consistent improvements across benchmarks such as NTU-60, NTU-120, and PKU-II on tasks like linear evaluation, transfer learning, and kNN evaluation. Our code will be made available at https://github.com/anshulbshah/HaLP.

Anshul Shah, Aniket Roy, Ketul Shah, Shlok Kumar Mishra, David Jacobs, Anoop Cherian, Rama Chellappa• 2023

Related benchmarks

Task	Dataset	Result
Action Recognition	NTU RGB+D 120 (X-set)	Accuracy72.2	779
Action Recognition	NTU RGB+D 60 (X-sub)	--	496
Action Recognition	NTU RGB+D X-sub 120	Accuracy71.1	482
Action Recognition	NTU-60 (xsub)	Accuracy79.7	271
Action Recognition	NTU-120 (cross-subject (xsub))	Accuracy71.1	239
Skeleton-based Action Recognition	NTU 60 (X-sub)	Accuracy79.7	227
Action Recognition	NTU RGB+D X-View 60	Accuracy86.8	218
Skeleton-based Action Recognition	NTU RGB+D 120 (X-set)	Top-1 Accuracy72.2	184
Skeleton-based Action Recognition	NTU 120 (X-sub)	Accuracy71.1	153
Skeleton-based Action Recognition	NTU RGB+D 60 (X-View)	Top-1 Accuracy86.8	126

Showing 10 of 29 rows

Other info

Follow for update

@wizwand_team Discord