MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer
About
Transferring visual-language knowledge from large-scale foundation models for video recognition has proved to be effective. To bridge the domain gap, additional parametric modules are added to capture the temporal information. However, zero-shot generalization diminishes with the increase in the number of specialized parameters, making existing works a trade-off between zero-shot and close-set performance. In this paper, we present MoTE, a novel framework that enables generalization and specialization to be balanced in one unified model. Our approach tunes a mixture of temporal experts to learn multiple task views with various degrees of data fitting. To maximally preserve the knowledge of each expert, we propose \emph{Weight Merging Regularization}, which regularizes the merging process of experts in weight space. Additionally with temporal feature modulation to regularize the contribution of temporal feature during test. We achieve a sound balance between zero-shot and close-set video recognition tasks and obtain state-of-the-art or competitive results on various datasets, including Kinetics-400 \& 600, UCF, and HMDB. Code is available at \url{https://github.com/ZMHH-H/MoTE}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Recognition | HMDB51 | -- | 89 | |
| Video Recognition | UCF101 | Top-1 Acc93.6 | 64 | |
| Video Recognition | SS v2 | Top-1 Acc12.2 | 47 | |
| Video Recognition | Kinetics-400 close-set | Top-1 Acc87.2 | 21 | |
| Video Recognition | HMDB51 (test) | Accuracy61.4 | 19 | |
| Zero-Shot Video Recognition | UCF, HMDB, and Kinetics-600 Zero-shot | HMDB zs Acc74.8 | 18 | |
| Video Recognition | UCF-101 (test) | Accuracy88.7 | 16 | |
| Video Recognition | Kinetics-600 (test) | Accuracy78.4 | 15 | |
| Few-shot video recognition | UCF-101 | Top-1 Acc (K=2)88.1 | 13 | |
| Few-shot video recognition | HMDB-51 | Top-1 Acc (K=2)0.61 | 13 |