Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer

About

Transferring visual-language knowledge from large-scale foundation models for video recognition has proved to be effective. To bridge the domain gap, additional parametric modules are added to capture the temporal information. However, zero-shot generalization diminishes with the increase in the number of specialized parameters, making existing works a trade-off between zero-shot and close-set performance. In this paper, we present MoTE, a novel framework that enables generalization and specialization to be balanced in one unified model. Our approach tunes a mixture of temporal experts to learn multiple task views with various degrees of data fitting. To maximally preserve the knowledge of each expert, we propose \emph{Weight Merging Regularization}, which regularizes the merging process of experts in weight space. Additionally with temporal feature modulation to regularize the contribution of temporal feature during test. We achieve a sound balance between zero-shot and close-set video recognition tasks and obtain state-of-the-art or competitive results on various datasets, including Kinetics-400 \& 600, UCF, and HMDB. Code is available at \url{https://github.com/ZMHH-H/MoTE}.

Minghao Zhu, Zhengpu Wang, Mengxian Hu, Ronghao Dang, Xiao Lin, Xun Zhou, Chengju Liu, Qijun Chen• 2024

Related benchmarks

TaskDatasetResultRank
Video RecognitionHMDB51--
89
Video RecognitionUCF101
Top-1 Acc93.6
64
Video RecognitionSS v2
Top-1 Acc12.2
47
Video RecognitionKinetics-400 close-set
Top-1 Acc87.2
21
Video RecognitionHMDB51 (test)
Accuracy61.4
19
Zero-Shot Video RecognitionUCF, HMDB, and Kinetics-600 Zero-shot
HMDB zs Acc74.8
18
Video RecognitionUCF-101 (test)
Accuracy88.7
16
Video RecognitionKinetics-600 (test)
Accuracy78.4
15
Few-shot video recognitionUCF-101
Top-1 Acc (K=2)88.1
13
Few-shot video recognitionHMDB-51
Top-1 Acc (K=2)0.61
13
Showing 10 of 11 rows

Other info

Code

Follow for update