What Makes Training Multi-Modal Classification Networks Hard?
About
Consider end-to-end training of a multi-modal vs. a single-modal network on a task with multiple input modalities: the multi-modal network receives more information, so it should match or outperform its single-modal counterpart. In our experiments, however, we observe the opposite: the best single-modal network always outperforms the multi-modal network. This observation is consistent across different combinations of modalities and on different tasks and benchmarks. This paper identifies two main causes for this performance drop: first, multi-modal networks are often prone to overfitting due to increased capacity. Second, different modalities overfit and generalize at different rates, so training them jointly with a single optimization strategy is sub-optimal. We address these two problems with a technique we call Gradient Blending, which computes an optimal blend of modalities based on their overfitting behavior. We demonstrate that Gradient Blending outperforms widely-used baselines for avoiding overfitting and achieves state-of-the-art accuracy on various tasks including human action recognition, ego-centric action recognition, and acoustic event detection.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | Kinetics-400 | Top-1 Acc80.4 | 413 | |
| Action Recognition | UCF-101 | Top-1 Acc83.09 | 147 | |
| Audio Classification | AudioSet 20K | -- | 128 | |
| Text-to-Video Retrieval | YouCook2 | Recall@1031.07 | 117 | |
| Audio Classification | ESC50 | Top-1 Acc83.5 | 64 | |
| Video Action Classification | Kinetics-400 | Top-1 Accuracy0.7859 | 48 | |
| Sound classification | AudioSet (evaluation) | mAP41.8 | 39 | |
| Action Recognition | EPIC-KITCHENS (val) | Verb Top-1 Acc59.2 | 36 | |
| Acoustic event detection | AudioSet (test) | mAP0.418 | 34 | |
| Action Recognition | EPIC-Kitchens v1 (test s2 (unseen)) | Actions Top-1 Acc26.6 | 32 |