Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Fusion or Confusion? Multimodal Complexity Is Not All You Need

About

Multimodal learning has become a prominent research area, with the potential of substantial performance gains by combining information across modalities. At the same time, model development has trended toward increasingly complex deep learning architectures, motivated by the assumption that multimodal-specific methods improve performance. We challenge this assumption through a large-scale empirical study by reimplementing 19 high-impact multimodal methods across nine diverse datasets with up to 23 modalities. Under standardized experimental conditions, including hyperparameter tuning, weight initialization, cross-validation, and statistical testing, increased multimodal complexity often yields confusion rather than effective fusion of data modalities. Accordingly, complex multimodal architectures do not reliably outperform unimodal baselines and a Simple Baseline for Multimodal Learning (SimBaMM). Through a focused case study, we further demonstrate concrete methodological shortcomings even in top-tier multimodal learning publications, underscoring the need for standardized evaluation practices. In summary, we argue for a shift in focus for multimodal learning: away from the pursuit of architectural novelty and toward methodological rigor.

Tillmann Rheude, Roland Eils, Benjamin Wild• 2025

Related benchmarks

TaskDatasetResultRank
Emotion RecognitionMOSEI
Accuracy (7-Class)49.36
26
Emotion RecognitionMOSI
Accuracy (7-Class)32.29
26
Emotion RecognitionCH-SIMS 2
Accuracy (5-class)43.51
26
Emotion RecognitionCH-SIMS
Accuracy (5-Class)52.76
26
Multimodal ClassificationSymile
AUROC0.6429
24
Multimodal ClassificationHAIM
AUROC0.6985
24
Emotion RecognitionCREMA-D
Accuracy (6)67.23
23
Multimodal ClassificationINSPECT
AUROC65.56
22
Multimodal ClassificationUKB
AUROC0.7957
21
Showing 9 of 9 rows

Other info

Follow for update