Provable Dynamic Fusion for Low-Quality Multimodal Data

About

The inherent challenge of multimodal fusion is to precisely capture the cross-modal correlation and flexibly conduct cross-modal interaction. To fully release the value of each modality and mitigate the influence of low-quality multimodal data, dynamic multimodal fusion emerges as a promising learning paradigm. Despite its widespread use, theoretical justifications in this field are still notably lacking. Can we design a provably robust multimodal fusion method? This paper provides theoretical understandings to answer this question under a most popular multimodal fusion framework from the generalization perspective. We proceed to reveal that several uncertainty estimation solutions are naturally available to achieve robust multimodal fusion. Then a novel multimodal fusion framework termed Quality-aware Multimodal Fusion (QMF) is proposed, which can improve the performance in terms of classification accuracy and model robustness. Extensive experimental results on multiple benchmarks can support our findings.

Qingyang Zhang, Haitao Wu, Changqing Zhang, Qinghua Hu, Huazhu Fu, Joey Tianyi Zhou, Xi Peng• 2023

Related benchmarks

Task	Dataset	Result
Multimodal Emotion Recognition	IEMOCAP (test)	Accuracy76.17	162
Emotion Recognition	IEMOCAP	Accuracy72.08	151
Audio-Image-Text Classification	IEMOCAP (test)	Accuracy76.17	116
Audio-Visual Classification	CREMA-D (test)	Accuracy63.71	60
Multimodal Classification	KS (test)	Accuracy65.78	48
Multimodal Classification	MVSA (test)	Accuracy (%)77.96	48
Multimodal Multiclass Classification	Food-101 (test)	Accuracy92.87	45
Multimodal Classification	BRCA (train test)	Accuracy82.5	36
Multimodal Classification	FOOD101 UPMC (train test)	Accuracy91.7	36
Multimodal Classification	ROSMAP (train test)	Accuracy78.3	36

Showing 10 of 39 rows

Other info

Follow for update

@wizwand_team Discord