Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

D$^2$ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition

About

Adapting pre-trained image models to video modality has proven to be an effective strategy for robust few-shot action recognition. In this work, we explore the potential of adapter tuning in image-to-video model adaptation and propose a novel video adapter tuning framework, called Disentangled-and-Deformable Spatio-Temporal Adapter (D$^2$ST-Adapter). It features a lightweight design, low adaptation overhead and powerful spatio-temporal feature adaptation capabilities. D$^2$ST-Adapter is structured with an internal dual-pathway architecture that enables built-in disentangled encoding of spatial and temporal features within the adapter, seamlessly integrating into the single-stream feature learning framework of pre-trained image models. In particular, we develop an efficient yet effective implementation of the D$^2$ST-Adapter, incorporating the specially devised anisotropic Deformable Spatio-Temporal Attention as its pivotal operation. This mechanism can be individually tailored for two pathways with anisotropic sampling densities along the spatial and temporal domains in 3D spatio-temporal space, enabling disentangled encoding of spatial and temporal features while maintaining a lightweight design. Extensive experiments by instantiating our method on both pre-trained ResNet and ViT demonstrate the superiority of our method over state-of-the-art methods. Our method is particularly well-suited to challenging scenarios where temporal dynamics are critical for action recognition. Code is available at https://github.com/qizhongtan/D2ST-Adapter.

Wenjie Pei, Qizhong Tan, Guangming Lu, Jiandong Tian, Jun Yu• 2023

Related benchmarks

TaskDatasetResultRank
Action RecognitionKinetics
Accuracy (5-shot)95.5
47
Action RecognitionSSv2 Small
Top-1 Acc (1-shot)55
26
Action RecognitionSS Full v2
1-shot Accuracy66.7
21
Action RecognitionHMDB51
Accuracy (1-shot)77.1
16
Action RecognitionUCF101
1-shot Accuracy96.4
16
Few-shot Action RecognitionHMDB51
Accuracy77.1
4
Few-shot Action RecognitionUCF101
Accuracy96.4
4
Showing 7 of 7 rows

Other info

Follow for update