Fusion or Confusion? Multimodal Complexity Is Not All You Need

About

Multimodal learning has become a prominent research area, with the potential of substantial performance gains by combining information across modalities. At the same time, model development has trended toward increasingly complex deep learning architectures, motivated by the assumption that multimodal-specific methods improve performance. We challenge this assumption through a large-scale empirical study by reimplementing 19 high-impact multimodal methods across nine diverse datasets with up to 23 modalities. Under standardized experimental conditions, including hyperparameter tuning, weight initialization, cross-validation, and statistical testing, increased multimodal complexity often yields confusion rather than effective fusion of data modalities. Accordingly, complex multimodal architectures do not reliably outperform unimodal baselines and a Simple Baseline for Multimodal Learning (SimBaMM). Through a focused case study, we further demonstrate concrete methodological shortcomings even in top-tier multimodal learning publications, underscoring the need for standardized evaluation practices. In summary, we argue for a shift in focus for multimodal learning: away from the pursuit of architectural novelty and toward methodological rigor.

Tillmann Rheude, Roland Eils, Benjamin Wild• 2025

Related benchmarks

Task	Dataset	Result
Emotion Recognition	MOSEI	Accuracy (7-Class)49.36	26
Emotion Recognition	MOSI	Accuracy (7-Class)32.29	26
Emotion Recognition	CH-SIMS 2	Accuracy (5-class)43.51	26
Emotion Recognition	CH-SIMS	Accuracy (5-Class)52.76	26
Multimodal Classification	Symile	AUROC0.6429	24
Multimodal Classification	HAIM	AUROC0.6985	24
Emotion Recognition	CREMA-D	Accuracy (6)67.23	23
Multimodal Classification	INSPECT	AUROC65.56	22
Multimodal Classification	UKB	AUROC0.7957	21

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord