Less is More: Decoder-Free Masked Modeling for Efficient Skeleton Representation Learning
About
The landscape of skeleton-based action representation learning has evolved from Contrastive Learning (CL) to Masked Auto-Encoder (MAE) architectures. However, each paradigm faces inherent limitations: CL often overlooks fine-grained local details, while MAE is burdened by computationally heavy decoders. Moreover, MAE suffers from severe computational asymmetry -- benefiting from efficient masking during pre-training but requiring exhaustive full-sequence processing for downstream tasks. To resolve these bottlenecks, we propose SLiM (Skeleton Less is More), a novel unified framework that harmonizes masked modeling with contrastive learning via a shared encoder. By eschewing the reconstruction decoder, SLiM not only eliminates computational redundancy but also compels the encoder to capture discriminative features directly. SLiM is the first framework with decoder-free masked modeling of representative learning. Crucially, to prevent trivial reconstruction arising from high skeletal-temporal correlation, we introduce semantic tube masking, alongside skeletal-aware augmentations designed to ensure anatomical consistency across diverse temporal granularities. Extensive experiments demonstrate that SLiM consistently achieves state-of-the-art performance across all downstream protocols. Notably, our method delivers this superior accuracy with exceptional efficiency, reducing inference computational cost by 7.89x compared to existing MAE methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | NTU RGB+D 120 (X-set) | Accuracy83.6 | 717 | |
| Action Recognition | NTU RGB+D 60 (X-sub) | -- | 467 | |
| Action Recognition | NTU RGB+D X-sub 120 | Accuracy81.2 | 430 | |
| Action Recognition | NTU-60 (xsub) | Accuracy87.9 | 223 | |
| Action Recognition | NTU RGB+D X-View 60 | Accuracy93.2 | 190 | |
| Action Recognition | NTU-60 (xview) | Accuracy93.2 | 117 | |
| Action Recognition | PKU-MMD (Part II) | -- | 71 | |
| Action Recognition | PKU-MMD (XSub) | Top-1 Acc59.7 | 43 | |
| Action Recognition | NTU 60 (X-sub) | Accuracy (10% data)88.8 | 35 | |
| Action Retrieval | NTU 60 (X-view) | Accuracy89.9 | 28 |