Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

A Theory of Multimodal Learning

About

Human perception of the empirical world involves recognizing the diverse appearances, or 'modalities', of underlying objects. Despite the longstanding consideration of this perspective in philosophy and cognitive science, the study of multimodality remains relatively under-explored within the field of machine learning. Nevertheless, current studies of multimodal machine learning are limited to empirical practices, lacking theoretical foundations beyond heuristic arguments. An intriguing finding from the practice of multimodal learning is that a model trained on multiple modalities can outperform a finely-tuned unimodal model, even on unimodal tasks. This paper provides a theoretical framework that explains this phenomenon, by studying generalization properties of multimodal learning algorithms. We demonstrate that multimodal learning allows for a superior generalization bound compared to unimodal learning, up to a factor of $O(\sqrt{n})$, where $n$ represents the sample size. Such advantage occurs when both connection and heterogeneity exist between the modalities.

Zhou Lu• 2023

Related benchmarks

TaskDatasetResultRank
Audio-Visual Event LocalizationAVE (test)
Accuracy67.41
54
Multimodal ClassificationKinetics-Sounds (test)
Multimodal Accuracy59.83
30
Multimodal ClassificationCREMA-D
Accuracy60.26
28
Audio-Visual Event ClassificationVGGSound (test)
Fusion Top-1 Acc60.8
23
Multimodal ClassificationUR-FUNNY
Accuracy63.1
21
Sentiment analysis and emotion recognitionCMU-MOSEI (test)
Inference Time (s)0.279
5
Showing 6 of 6 rows

Other info

Follow for update