Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Fusion or Confusion? Multimodal Complexity Is Not All You Need

About

Deep learning architectures for multimodal learning have increased in complexity, driven by the assumption that multimodal-specific methods improve performance. We challenge this assumption through a large-scale empirical study reimplementing 19 high-impact methods under standardized conditions. We evaluate them across nine diverse datasets with up to 23 modalities, and test their generalizability to new tasks beyond their original scope, including settings with missing modalities. We propose a Simple Baseline for Multimodal Learning (SimBaMM), a late-fusion Transformer architecture, and demonstrate that under standardized experimental conditions with rigorous hyperparameter tuning of all methods, more complex architectures do not reliably outperform SimBaMM. Statistical analyses show that complex methods perform on par with SimBaMM and often fail to consistently outperform well-tuned unimodal baselines, especially in small-data settings. To support our findings, we include a case study highlighting common methodological shortcomings in the literature followed by a pragmatic reliability checklist to promote comparable, robust, and trustworthy future evaluations. In summary, we argue for a shift in focus: away from the pursuit of architectural novelty and toward methodological rigor.

Tillmann Rheude, Roland Eils, Benjamin Wild• 2025

Related benchmarks

TaskDatasetResultRank
Emotion RecognitionMOSEI
Accuracy (7-Class)49.36
26
Emotion RecognitionMOSI
Accuracy (7-Class)32.29
26
Emotion RecognitionCH-SIMS 2
Accuracy (5-class)43.51
26
Emotion RecognitionCH-SIMS
Accuracy (5-Class)52.76
26
Multimodal ClassificationSymile
AUROC0.6429
24
Multimodal ClassificationHAIM
AUROC0.6985
24
Emotion RecognitionCREMA-D
Accuracy (6)67.23
23
Multimodal ClassificationINSPECT
AUROC65.56
22
Multimodal ClassificationUKB
AUROC0.7957
21
Showing 9 of 9 rows

Other info

Follow for update