Learnable Motion-Focused Tokenization for Effective and Efficient Video Unsupervised Domain Adaptation

About

Video Unsupervised Domain Adaptation (VUDA) poses a significant challenge in action recognition, requiring the adaptation of a model from a labeled source domain to an unlabeled target domain. Despite recent advances, existing VUDA methods often fall short of fully supervised performance, a key reason being the prevalence of static and uninformative backgrounds that exacerbate domain shifts. Additionally, prior approaches largely overlook computational efficiency, limiting real-world adoption. To address these issues, we propose Learnable Motion-Focused Tokenization (LMFT) for VUDA. LMFT tokenizes video frames into patch tokens and learns to discard low-motion, redundant tokens, primarily corresponding to background regions, while retaining motion-rich, action-relevant tokens for adaptation. Extensive experiments on three standard VUDA benchmarks across 21 domain adaptation settings show that our VUDA framework with LMFT achieves state-of-the-art performance while significantly reducing computational overhead. LMFT thus enables VUDA that is both effective and computationally efficient.

Tzu Ling Liu, Ian Stavness, Mrigank Rochan• 2026

Related benchmarks

Task	Dataset	Result
Unsupervised Domain Adaptation	UCF-HMDB	Accuracy (U -> H)94.2	52
Video Unsupervised Domain Adaptation	Daily-DA (test)	Accuracy (H → A)47.8	13
Video Unsupervised Domain Adaptation	ActorShift	Transfer Score KT to C188.9	7

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord