Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Skeleton-Snippet Contrastive Learning with Multiscale Feature Fusion for Action Localization

About

The self-supervised pretraining paradigm has achieved great success in learning 3D action representations for skeleton-based action recognition using contrastive learning. However, learning effective representations for skeleton-based temporal action localization remains challenging and underexplored. Unlike video-level {action} recognition, detecting action boundaries requires temporally sensitive features that capture subtle differences between adjacent frames where labels change. To this end, we formulate a snippet discrimination pretext task for self-supervised pretraining, which densely projects skeleton sequences into non-overlapping segments and promotes features that distinguish them across videos via contrastive learning. Additionally, we build on strong backbones of skeleton-based action recognition models by fusing intermediate features with a U-shaped module to enhance feature resolution for frame-level localization. Our approach consistently improves existing skeleton-based contrastive learning methods for action localization on BABEL across diverse subsets and evaluation protocols. We also achieve state-of-the-art transfer learning performance on PKUMMD with pretraining on NTU RGB+D and BABEL.

Qiushuo Cheng, Jingjing Liu, Catherine Morgan, Alan Whone, Majid Mirmehdi• 2025

Related benchmarks

TaskDatasetResultRank
Action LocalizationPKUMMD (test)
mAP@0.586
13
Action LocalizationBABEL (test)
mAP@tIoU (Subset-1)53.9
12
Action LocalizationBABEL Subset-2 v1.0 (test)
mAP@0.165.2
12
Action LocalizationBABEL Subset-3 v1.0 (test)
mAP@0.142
6
Action LocalizationBABEL Subset-1
mAP@0.160.5
6
Action LocalizationBABEL Subset-3
mAP@0.148.5
6
Action LocalizationBABEL Subset-1 v1.0 (test)
mAP@0.152.6
6
Showing 7 of 7 rows

Other info

Follow for update