Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning

About

Masked video modeling, such as VideoMAE, is an effective paradigm for video self-supervised learning (SSL). However, they are primarily based on reconstructing pixel-level details on natural videos which have substantial temporal redundancy, limiting their capability for semantic representation and sufficient encoding of motion dynamics. To address these issues, this paper introduces a novel SSL approach for video representation learning, dubbed as SMILE, by infusing both spatial and motion semantics. In SMILE, we leverage image-language pretrained models, such as CLIP, to guide the learning process with their high-level spatial semantics. We enhance the representation of motion by introducing synthetic motion patterns in the training data, allowing the model to capture more complex and dynamic content. Furthermore, using SMILE, we establish a new self-supervised video learning paradigm capable of learning strong video representations without requiring any natural video data. We have carried out extensive experiments on 7 datasets with various downstream scenarios. SMILE surpasses current state-of-the-art SSL methods, showcasing its effectiveness in learning more discriminative and generalizable video representations. Code is available: https://github.com/fmthoker/SMILE

Fida Mohammad Thoker, Letian Jiang, Chen Zhao, Bernard Ghanem• 2025

Related benchmarks

TaskDatasetResultRank
Action RecognitionSomething-Something v2 (val)
Top-1 Accuracy72.5
535
Action RecognitionKinetics-400
Top-1 Acc56.2
413
Action RecognitionSomething-Something v2
Top-1 Accuracy23.7
341
Action RecognitionKinetics 400 (test)
Top-1 Accuracy83.1
245
Action RecognitionHMDB51
Top-1 Acc53.4
225
Video Action RecognitionKinetics-400
Top-1 Acc83.1
184
Video Action RecognitionUCF101
Top-1 Acc96.4
153
Action RecognitionUCF-101
Top-1 Acc83.8
147
Video Action ClassificationSomething-Something v2
Top-1 Acc71.9
139
Video ClassificationKinetics-400
Top-1 Acc83.4
131
Showing 10 of 23 rows

Other info

Code

Follow for update