MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer

About

Transferring visual-language knowledge from large-scale foundation models for video recognition has proved to be effective. To bridge the domain gap, additional parametric modules are added to capture the temporal information. However, zero-shot generalization diminishes with the increase in the number of specialized parameters, making existing works a trade-off between zero-shot and close-set performance. In this paper, we present MoTE, a novel framework that enables generalization and specialization to be balanced in one unified model. Our approach tunes a mixture of temporal experts to learn multiple task views with various degrees of data fitting. To maximally preserve the knowledge of each expert, we propose \emph{Weight Merging Regularization}, which regularizes the merging process of experts in weight space. Additionally with temporal feature modulation to regularize the contribution of temporal feature during test. We achieve a sound balance between zero-shot and close-set video recognition tasks and obtain state-of-the-art or competitive results on various datasets, including Kinetics-400 \& 600, UCF, and HMDB. Code is available at \url{https://github.com/ZMHH-H/MoTE}.

Minghao Zhu, Zhengpu Wang, Mengxian Hu, Ronghao Dang, Xiao Lin, Xun Zhou, Chengju Liu, Qijun Chen• 2024

Related benchmarks

Task	Dataset	Result
Video Recognition	HMDB51	Accuracy68.2	145
Video Recognition	UCF101	Accuracy93.6	111
Video Recognition	SS v2	Accuracy12.2	64
Video Recognition	Kinetics 400 (test)	--	54
Video Recognition	Kinetics-400 close-set	Top-1 Acc87.2	21
Video Recognition	HMDB51 (test)	Accuracy61.4	19
Zero-Shot Video Recognition	UCF, HMDB, and Kinetics-600 Zero-shot	HMDB zs Acc74.8	18
Video Recognition	UCF-101 (test)	Accuracy88.7	16
Video Recognition	Kinetics-600 (test)	Accuracy78.4	15
Few-shot video recognition	UCF-101	Top-1 Acc (K=2)88.1	13

Showing 10 of 12 rows

Other info

Code

Follow for update

@wizwand_team Discord