Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning

About

Large pre-trained vision models achieve impressive success in computer vision. However, fully fine-tuning large models for downstream tasks, particularly in video understanding, can be prohibitively computationally expensive. Recent studies turn their focus towards efficient image-to-video transfer learning. Nevertheless, existing efficient fine-tuning methods lack attention to training memory usage and exploration of transferring a larger model to the video domain. In this paper, we present a novel Spatial-Temporal Side Network for memory-efficient fine-tuning large image models to video understanding, named Side4Video. Specifically, we introduce a lightweight spatial-temporal side network attached to the frozen vision model, which avoids the backpropagation through the heavy pre-trained model and utilizes multi-level spatial features from the original image model. Extremely memory-efficient architecture enables our method to reduce 75% memory usage than previous adapter-based methods. In this way, we can transfer a huge ViT-E (4.4B) for video understanding tasks which is 14x larger than ViT-L (304M). Our approach achieves remarkable performance on various video datasets across unimodal and cross-modal tasks (i.e., action recognition and text-video retrieval), especially in Something-Something V1&V2 (67.3% & 74.6%), Kinetics-400 (88.6%), MSR-VTT (52.3%), MSVD (56.1%) and VATEX (68.8%). We release our code at https://github.com/HJYao00/Side4Video.

Huanjin Yao, Wenhao Wu, Zhiheng Li• 2023

Related benchmarks

TaskDatasetResultRank
Action RecognitionSomething-Something v2 (val)
Top-1 Accuracy75.2
535
Action RecognitionKinetics-400
Top-1 Acc88.6
413
Action RecognitionSomething-something v1 (val)
Top-1 Acc67.3
257
Text-to-Video RetrievalMSVD (test)
R@156.1
204
Video Action RecognitionKinetics 400 (val)
Top-1 Acc84.2
151
Text-to-Video RetrievalVATEX (test)
R@168.8
62
Video-to-Text retrievalMSVD (test)
R@171.7
61
Text-to-Video RetrievalMSR-VTT 1K (test)
R@152.3
45
Video-to-Text retrievalMSR-VTT 1K (test)
R@150.4
39
Video-to-Text retrievalVATEX (test)
Recall@182.3
15
Showing 10 of 10 rows

Other info

Code

Follow for update