Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models

About

Foundation models have shown outstanding performance and generalization capabilities across domains. Since most studies on foundation models mainly focus on the pretraining phase, a naive strategy to minimize a single task-specific loss is adopted for fine-tuning. However, such fine-tuning methods do not fully leverage other losses that are potentially beneficial for the target task. Therefore, we propose MEta Loss TRansformer (MELTR), a plug-in module that automatically and non-linearly combines various loss functions to aid learning the target task via auxiliary learning. We formulate the auxiliary learning as a bi-level optimization problem and present an efficient optimization algorithm based on Approximate Implicit Differentiation (AID). For evaluation, we apply our framework to various video foundation models (UniVL, Violet and All-in-one), and show significant performance gain on all four downstream tasks: text-to-video retrieval, video question answering, video captioning, and multi-modal sentiment analysis. Our qualitative analyses demonstrate that MELTR adequately `transforms' individual loss functions and `melts' them into an effective unified loss. Code is available at https://github.com/mlvlab/MELTR.

Dohwan Ko, Joonmyung Choi, Hyeong Kyu Choi, Kyoung-Woon On, Byungseok Roh, Hyunwoo J. Kim• 2023

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringMSVD-QA
Accuracy51.7
340
Multimodal Sentiment AnalysisCMU-MOSI (test)
F185.4
238
Video Question AnsweringTGIF-QA--
147
Text-to-Video RetrievalYouCook2
Recall@1074.8
117
Video CaptioningYouCook2--
104
Video CaptioningMSRVTT
CIDEr52.8
101
Video CaptioningYoucook2 (test)
CIDEr190
42
Video Level SummarizationYouCook2
METEOR22.56
21
Video CaptioningMSRVTT (full)
CIDEr52.77
20
Text-to-Video RetrievalMSRVTT 9k
R@141.3
14
Showing 10 of 11 rows

Other info

Code

Follow for update