Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

What Makes Training Multi-Modal Classification Networks Hard?

About

Consider end-to-end training of a multi-modal vs. a single-modal network on a task with multiple input modalities: the multi-modal network receives more information, so it should match or outperform its single-modal counterpart. In our experiments, however, we observe the opposite: the best single-modal network always outperforms the multi-modal network. This observation is consistent across different combinations of modalities and on different tasks and benchmarks. This paper identifies two main causes for this performance drop: first, multi-modal networks are often prone to overfitting due to increased capacity. Second, different modalities overfit and generalize at different rates, so training them jointly with a single optimization strategy is sub-optimal. We address these two problems with a technique we call Gradient Blending, which computes an optimal blend of modalities based on their overfitting behavior. We demonstrate that Gradient Blending outperforms widely-used baselines for avoiding overfitting and achieves state-of-the-art accuracy on various tasks including human action recognition, ego-centric action recognition, and acoustic event detection.

Weiyao Wang, Du Tran, Matt Feiszli• 2019

Related benchmarks

TaskDatasetResultRank
Action RecognitionKinetics-400
Top-1 Acc80.4
413
Action RecognitionUCF-101
Top-1 Acc83.09
147
Audio ClassificationAudioSet 20K--
128
Text-to-Video RetrievalYouCook2
Recall@1031.07
117
Audio ClassificationESC50
Top-1 Acc83.5
64
Video Action ClassificationKinetics-400
Top-1 Accuracy0.7859
48
Sound classificationAudioSet (evaluation)
mAP41.8
39
Action RecognitionEPIC-KITCHENS (val)
Verb Top-1 Acc59.2
36
Acoustic event detectionAudioSet (test)
mAP0.418
34
Action RecognitionEPIC-Kitchens v1 (test s2 (unseen))
Actions Top-1 Acc26.6
32
Showing 10 of 40 rows

Other info

Follow for update