Fusion or Confusion? Multimodal Complexity Is Not All You Need
About
Deep learning architectures for multimodal learning have increased in complexity, driven by the assumption that multimodal-specific methods improve performance. We challenge this assumption through a large-scale empirical study reimplementing 19 high-impact methods under standardized conditions. We evaluate them across nine diverse datasets with up to 23 modalities, and test their generalizability to new tasks beyond their original scope, including settings with missing modalities. We propose a Simple Baseline for Multimodal Learning (SimBaMM), a late-fusion Transformer architecture, and demonstrate that under standardized experimental conditions with rigorous hyperparameter tuning of all methods, more complex architectures do not reliably outperform SimBaMM. Statistical analyses show that complex methods perform on par with SimBaMM and often fail to consistently outperform well-tuned unimodal baselines, especially in small-data settings. To support our findings, we include a case study highlighting common methodological shortcomings in the literature followed by a pragmatic reliability checklist to promote comparable, robust, and trustworthy future evaluations. In summary, we argue for a shift in focus: away from the pursuit of architectural novelty and toward methodological rigor.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Emotion Recognition | MOSEI | Accuracy (7-Class)49.36 | 26 | |
| Emotion Recognition | MOSI | Accuracy (7-Class)32.29 | 26 | |
| Emotion Recognition | CH-SIMS 2 | Accuracy (5-class)43.51 | 26 | |
| Emotion Recognition | CH-SIMS | Accuracy (5-Class)52.76 | 26 | |
| Multimodal Classification | Symile | AUROC0.6429 | 24 | |
| Multimodal Classification | HAIM | AUROC0.6985 | 24 | |
| Emotion Recognition | CREMA-D | Accuracy (6)67.23 | 23 | |
| Multimodal Classification | INSPECT | AUROC65.56 | 22 | |
| Multimodal Classification | UKB | AUROC0.7957 | 21 |