Variational Contrastive Learning for Skeleton-based Action Recognition

About

In recent years, self-supervised representation learning for skeleton-based action recognition has advanced with the development of contrastive learning methods. However, most of contrastive paradigms are inherently discriminative and often struggle to capture the variability and uncertainty intrinsic to human motion. To address this issue, we propose a variational contrastive learning framework that integrates probabilistic latent modeling with contrastive self-supervised learning. This formulation enables the learning of structured and semantically meaningful representations that generalize across different datasets and supervision levels. Extensive experiments on three widely used skeleton-based action recognition benchmarks show that our proposed method consistently outperforms existing approaches, particularly in low-label regimes. Moreover, qualitative analyses show that the features provided by our method are more relevant given the motion and sample characteristics, with more focus on important skeleton joints, when compared to the other methods.

Dang Dinh Nguyen, Decky Aspandi Latif, Titus Zaharia• 2026

Related benchmarks

Task	Dataset	Result
Action Recognition	NTU-60 (xsub)	Accuracy86.6	271
Action Recognition	NTU-120 (cross-subject (xsub))	Accuracy79.8	239
Action Recognition	NTU 120 (Cross-Setup)	Accuracy81.4	231
Skeleton-based Action Recognition	NTU 60 (X-sub)	Accuracy75.2	227
Action Recognition	NTU RGB+D X-View 60	Accuracy92.9	218
Action Recognition	NTU-60 (xview)	Accuracy80.2	165
Skeleton-based Action Recognition	NTU 60 (X-view)	Accuracy80.2	125
Action Recognition	PKU-MMD (Part II)	Accuracy39.2	90
Action Recognition	PKU-MMD Part I	Accuracy86.1	74
Skeleton-based Action Recognition	NTU RGB+D 60 (Cross-Subject)	Accuracy75.2	59

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord