MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
About
Foundation models have shown outstanding performance and generalization capabilities across domains. Since most studies on foundation models mainly focus on the pretraining phase, a naive strategy to minimize a single task-specific loss is adopted for fine-tuning. However, such fine-tuning methods do not fully leverage other losses that are potentially beneficial for the target task. Therefore, we propose MEta Loss TRansformer (MELTR), a plug-in module that automatically and non-linearly combines various loss functions to aid learning the target task via auxiliary learning. We formulate the auxiliary learning as a bi-level optimization problem and present an efficient optimization algorithm based on Approximate Implicit Differentiation (AID). For evaluation, we apply our framework to various video foundation models (UniVL, Violet and All-in-one), and show significant performance gain on all four downstream tasks: text-to-video retrieval, video question answering, video captioning, and multi-modal sentiment analysis. Our qualitative analyses demonstrate that MELTR adequately `transforms' individual loss functions and `melts' them into an effective unified loss. Code is available at https://github.com/mlvlab/MELTR.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Question Answering | MSVD-QA | Accuracy51.7 | 340 | |
| Multimodal Sentiment Analysis | CMU-MOSI (test) | F185.4 | 238 | |
| Video Question Answering | TGIF-QA | -- | 147 | |
| Text-to-Video Retrieval | YouCook2 | Recall@1074.8 | 117 | |
| Video Captioning | YouCook2 | -- | 104 | |
| Video Captioning | MSRVTT | CIDEr52.8 | 101 | |
| Video Captioning | Youcook2 (test) | CIDEr190 | 42 | |
| Video Level Summarization | YouCook2 | METEOR22.56 | 21 | |
| Video Captioning | MSRVTT (full) | CIDEr52.77 | 20 | |
| Text-to-Video Retrieval | MSRVTT 9k | R@141.3 | 14 |